summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Make sure we don't go over the maximum jump stack boundary, from Taehee Yoo. 2) Missing rcu_barrier() in hash and rbtree sets, also from Taehee. 3) Missing check to nul-node in rbtree timeout routine, from Taehee. 4) Use dev->name from flowtable to fix a memleak, from Florian. 5) Oneliner to free flowtable object on removal, from Florian. 6) Memleak in chain rename transaction, again from Florian. 7) Don't allow two chains to use the same name in the same transaction, from Florian. 8) handle DCCP SYNC/SYNCACK as invalid, this triggers an uninitialized timer in conntrack reported by syzbot, from Florian. 9) Fix leak in case netlink_dump_start() fails, from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24Merge tag 'mac80211-for-davem-2018-07-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Only a few fixes: * always keep regulatory user hint * add missing break statement in station flags parsing * fix non-linear SKBs in port-control-over-nl80211 * reconfigure VLAN stations during HW restart ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24MAINTAINERS: Add Naveen N. Rao as kprobes co-maintainerAnanth N Mavinakayanahalli
Naveen has been contributing consistently reviewing and hardening kprobes for some time now. I have not been able to do the same due to other commitments. Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: akpm@linux-foundation.org Cc: mhiramat@kernel.org Link: http://lkml.kernel.org/r/153180735790.1914.15547706781664285286.stgit@thinktux Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-24i2c: imx: use open drain for recovery GPIOWolfram Sang
I2C is open drain, so request the GPIO accordingly, even if pinmux did set it up correctly for in-kernel users in this case. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-07-24i2c: rcar: handle RXDMA HW behaviour on Gen3Wolfram Sang
On Gen3, we can only do RXDMA once per transfer reliably. For that, we must reset the device, then we can have RXDMA once. This patch implements this. When there is no reset controller or the reset fails, RXDMA will be blocked completely. Otherwise, it will be disabled after the first RXDMA transfer. Based on a commit from the BSP by Hiromitsu Yamasaki, yet completely refactored to handle multiple read messages within one transfer. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2018-07-24nvme: if_ready checks to fail io to deleting controllerJames Smart
The revised if_ready checks skipped over the case of returning error when the controller is being deleted. Instead it was returning BUSY, which caused the ios to retry, which caused the ns delete to hang waiting for the ios to drain. Stack trace of hang looks like: kworker/u64:2 D 0 74 2 0x80000000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] Call Trace: ? __schedule+0x26d/0x820 schedule+0x32/0x80 blk_mq_freeze_queue_wait+0x36/0x80 ? remove_wait_queue+0x60/0x60 blk_cleanup_queue+0x72/0x160 nvme_ns_remove+0x106/0x140 [nvme_core] nvme_remove_namespaces+0x7e/0xa0 [nvme_core] nvme_delete_ctrl_work+0x4d/0x80 [nvme_core] process_one_work+0x160/0x350 worker_thread+0x1c3/0x3d0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Extend nvmf_fail_nonready_command() to supply the controller pointer so that the controller state can be looked at. Fail any io to a controller that is deleting. Fixes: 3bc32bb1186c ("nvme-fabrics: refactor queue ready check") Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com>
2018-07-24nvmet-fc: fix target sgl list on large transfersJames Smart
The existing code to carve up the sg list expected an sg element-per-page which can be very incorrect with iommu's remapping multiple memory pages to fewer bus addresses. To hit this error required a large io payload (greater than 256k) and a system that maps on a per-page basis. It's possible that large ios could get by fine if the system condensed the sgl list into the first 64 elements. This patch corrects the sg list handling by specifically walking the sg list element by element and attempting to divide the transfer up on a per-sg element boundary. While doing so, it still tries to keep sequences under 256k, but will exceed that rule if a single sg element is larger than 256k. Fixes: 48fa362b6c3f ("nvmet-fc: simplify sg list handling") Cc: <stable@vger.kernel.org> # 4.14 Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24cpufreq: qcom-kryo: add NULL entry to the end of_device_id arrayYueHaibing
Make sure of_device_id tables are NULL terminated. Found by coccinelle spatch "misc/of_table.cocci" Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Ilia Lin <ilia.lin@kernel.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-07-24x86/entry/64: Remove %ebx handling from error_entry/exitAndy Lutomirski
error_entry and error_exit communicate the user vs. kernel status of the frame using %ebx. This is unnecessary -- the information is in regs->cs. Just use regs->cs. This makes error_entry simpler and makes error_exit more robust. It also fixes a nasty bug. Before all the Spectre nonsense, the xen_failsafe_callback entry point returned like this: ALLOC_PT_GPREGS_ON_STACK SAVE_C_REGS SAVE_EXTRA_REGS ENCODE_FRAME_POINTER jmp error_exit And it did not go through error_entry. This was bogus: RBX contained garbage, and error_exit expected a flag in RBX. Fortunately, it generally contained *nonzero* garbage, so the correct code path was used. As part of the Spectre fixes, code was added to clear RBX to mitigate certain speculation attacks. Now, depending on kernel configuration, RBX got zeroed and, when running some Wine workloads, the kernel crashes. This was introduced by: commit 3ac6d8c787b8 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface") With this patch applied, RBX is no longer needed as a flag, and the problem goes away. I suspect that malicious userspace could use this bug to crash the kernel even without the offending patch applied, though. [ Historical note: I wrote this patch as a cleanup before I was aware of the bug it fixed. ] [ Note to stable maintainers: this should probably get applied to all kernels. If you're nervous about that, a more conservative fix to add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should also fix the problem. ] Reported-and-tested-by: M. Vefa Bicakci <m.v.b@runbox.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: xen-devel@lists.xenproject.org Fixes: 3ac6d8c787b8 ("x86/entry/64: Clear registers for exceptions/interrupts, to reduce speculation attack surface") Link: http://lkml.kernel.org/r/b5010a090d3586b2d6e06c7ad3ec5542d1241c45.1532282627.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-24x86/apic: Future-proof the TSC_DEADLINE quirk for SKXLen Brown
All SKX with stepping higher than 4 support the TSC_DEADLINE, no matter the microcode version. Without this patch, upcoming SKX steppings will not be able to use their TSC_DEADLINE timer. Signed-off-by: Len Brown <len.brown@intel.com> Cc: <stable@kernel.org> # v4.14+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 616dd5872e ("x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping") Link: http://lkml.kernel.org/r/d0c7129e509660be9ec6b233284b8d42d90659e8.1532207856.git.len.brown@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-24perf/x86/amd/ibs: Don't access non-started eventThomas Gleixner
Paul Menzel reported the following bug: > Enabling the undefined behavior sanitizer and building GNU/Linux 4.18-rc5+ > (with some unrelated commits) with GCC 8.1.0 from Debian Sid/unstable, the > warning below is shown. > > > [ 2.111913] > > ================================================================================ > > [ 2.111917] UBSAN: Undefined behaviour in arch/x86/events/amd/ibs.c:582:24 > > [ 2.111919] member access within null pointer of type 'struct perf_event' > > [ 2.111926] CPU: 0 PID: 144 Comm: udevadm Not tainted 4.18.0-rc5-00316-g4864b68cedf2 #104 > > [ 2.111928] Hardware name: ASROCK E350M1/E350M1, BIOS TIMELESS 01/01/1970 > > [ 2.111930] Call Trace: > > [ 2.111943] dump_stack+0x55/0x89 > > [ 2.111949] ubsan_epilogue+0xb/0x33 > > [ 2.111953] handle_null_ptr_deref+0x7f/0x90 > > [ 2.111958] __ubsan_handle_type_mismatch_v1+0x55/0x60 > > [ 2.111964] perf_ibs_handle_irq+0x596/0x620 The code dereferences event before checking the STARTED bit. Patch below should cure the issue. The warning should not trigger, if I analyzed the thing correctly. (And Paul's testing confirms this.) Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Menzel <pmenzel+linux-x86@molgen.mpg.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807200958390.1580@nanos.tec.linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-24mac80211: restrict delayed tailroom needed decrementManikanta Pubbisetty
As explained in ieee80211_delayed_tailroom_dec(), during roam, keys of the old AP will be destroyed and new keys will be installed. Deletion of the old key causes crypto_tx_tailroom_needed_cnt to go from 1 to 0 and the new key installation causes a transition from 0 to 1. Whenever crypto_tx_tailroom_needed_cnt transitions from 0 to 1, we invoke synchronize_net(); the reason for doing this is to avoid a race in the TX path as explained in increment_tailroom_need_count(). This synchronize_net() operation can be slow and can affect the station roam time. To avoid this, decrementing the crypto_tx_tailroom_needed_cnt is delayed for a while so that upon installation of new key the transition would be from 1 to 2 instead of 0 to 1 and thereby improving the roam time. This is all correct for a STA iftype, but deferring the tailroom_needed decrement for other iftypes may be unnecessary. For example, let's consider the case of a 4-addr client connecting to an AP for which AP_VLAN interface is also created, let the initial value for tailroom_needed on the AP be 1. * 4-addr client connects to the AP (AP: tailroom_needed = 1) * AP will clear old keys, delay decrement of tailroom_needed count * AP_VLAN is created, it takes the tailroom count from master (AP_VLAN: tailroom_needed = 1, AP: tailroom_needed = 1) * Install new key for the station, assume key is plumbed in the HW, there won't be any change in tailroom_needed count on AP iface * Delayed decrement of tailroom_needed count on AP (AP: tailroom_needed = 0, AP_VLAN: tailroom_needed = 1) Because of the delayed decrement on AP iface, tailroom_needed count goes out of sync between AP(master iface) and AP_VLAN(slave iface) and there would be unnecessary tailroom created for the packets going through AP_VLAN iface. Also, WARN_ONs were observed while trying to bring down the AP_VLAN interface: (warn_slowpath_common) (warn_slowpath_null+0x18/0x20) (warn_slowpath_null) (ieee80211_free_keys+0x114/0x1e4) (ieee80211_free_keys) (ieee80211_del_virtual_monitor+0x51c/0x850) (ieee80211_del_virtual_monitor) (ieee80211_stop+0x30/0x3c) (ieee80211_stop) (__dev_close_many+0x94/0xb8) (__dev_close_many) (dev_close_many+0x5c/0xc8) Restricting delayed decrement to station interface alone fixes the problem and it makes sense to do so because delayed decrement is done to improve roam time which is applicable only for client devices. Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-07-24wireless/lib80211: Convert from ahash to shashKees Cook
In preparing to remove all stack VLA usage from the kernel[1], this removes the discouraged use of AHASH_REQUEST_ON_STACK in favor of the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash to direct shash. The stack allocation will be made a fixed size in a later patch to the crypto subsystem. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-07-24cfg80211: never ignore user regulatory hintAmar Singhal
Currently user regulatory hint is ignored if all wiphys in the system are self managed. But the hint is not ignored if there is no wiphy in the system. This affects the global regulatory setting. Global regulatory setting needs to be maintained so that it can be applied to a new wiphy entering the system. Therefore, do not ignore user regulatory setting even if all wiphys in the system are self managed. Signed-off-by: Amar Singhal <asinghal@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-07-24s390: disable gcc pluginsMartin Schwidefsky
The s390 build currently fails with the latent entropy plugin: arch/s390/kernel/als.o: In function `verify_facilities': als.c:(.init.text+0x24): undefined reference to `latent_entropy' als.c:(.init.text+0xae): undefined reference to `latent_entropy' make[3]: *** [arch/s390/boot/compressed/vmlinux] Error 1 make[2]: *** [arch/s390/boot/compressed/vmlinux] Error 2 make[1]: *** [bzImage] Error 2 This will be fixed with the early boot rework from Vasily, which is planned for the 4.19 merge window. For 4.18 the simplest solution is to disable the gcc plugins and reenable them after the early boot rework is upstream. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-07-23Merge tag 'wireless-drivers-next-for-davem-2018-07-23' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for 4.19 The first set of patches for 4.19. Only smaller features and bug fixes, not really anything major. Also included are changes to include/linux/bitfield.h, we agreed with Johannes that it makes sense to apply them via wireless-drivers-next. Major changes: ath10k * support channel 173 * fix spectral scan for QCA9984 and QCA9888 chipsets ath6kl * add support for Dell Wireless 1537 ti wlcore * add support for runtime PM * enable runtime PM autosuspend support qtnfmac * support changing MAC address * enable source MAC address randomization support libertas * fix suspend and resume for SDIO cards mt76 * add software DFS radar pattern detector for mt76x2 based devices ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23sock: fix sg page frag coalescing in sk_alloc_sgDaniel Borkmann
Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and sockmap) is not quite correct in that we do fetch the previous sg entry, however the subsequent check whether the refilled page frag from the socket is still the same as from the last entry with prior offset and length matching the start of the current buffer is comparing always the first sg list entry instead of the prior one. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23Merge branch 'rds-ipv6'David S. Miller
Ka-Cheong Poon says: ==================== rds: IPv6 support This patch set adds IPv6 support to the kernel RDS and related modules. Existing RDS apps using IPv4 address continue to run without any problem. New RDS apps which want to use IPv6 address can do so by passing the address in struct sockaddr_in6 to bind(), connect() or sendmsg(). And those apps also need to use the new IPv6 equivalents of some of the existing socket options as the existing options use a 32 bit integer to store IP address. All RDS code now use struct in6_addr to store IP address. IPv4 address is stored as an IPv4 mapped address. Header file changes There are many data structures (RDS socket options) used by RDS apps which use a 32 bit integer to store IP address. To support IPv6, struct in6_addr needs to be used. To ensure backward compatibility, a new data structure is introduced for each of those data structures which use a 32 bit integer to represent an IP address. And new socket options are introduced to use those new structures. This means that existing apps should work without a problem with the new RDS module. For apps which want to use IPv6, those new data structures and socket options can be used. IPv4 mapped address is used to represent IPv4 address in the new data structures. Internally, all RDS data structures which contain an IP address are changed to use struct in6_addr to store the address. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_TCP_PORT to send the connection set up request. RDS/TCP changes TCP related code is changed to support IPv6. Note that only an IPv6 TCP listener on port RDS_TCP_PORT is created as it can accept both IPv4 and IPv6 connection requests. IB/RDMA changes The initial private data exchange between IB endpoints using RDMA is changed to support IPv6 address instead, if the peer address is IPv6. To ensure backward compatibility, annother RDMA listener port (RDS_CM_PORT) is used to accept IPv6 connection. An IPv6 capable RDS module continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23rds: Extend RDS API for IPv6 supportKa-Cheong Poon
There are many data structures (RDS socket options) used by RDS apps which use a 32 bit integer to store IP address. To support IPv6, struct in6_addr needs to be used. To ensure backward compatibility, a new data structure is introduced for each of those data structures which use a 32 bit integer to represent an IP address. And new socket options are introduced to use those new structures. This means that existing apps should work without a problem with the new RDS module. For apps which want to use IPv6, those new data structures and socket options can be used. IPv4 mapped address is used to represent IPv4 address in the new data structures. v4: Revert changes to SO_RDS_TRANSPORT Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23rds: Enable RDS IPv6 supportKa-Cheong Poon
This patch enables RDS to use IPv6 addresses. For RDS/TCP, the listener is now an IPv6 endpoint which accepts both IPv4 and IPv6 connection requests. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. v5: Fixed syntax problem (David Miller). v4: Changed port history comments in rds.h (Sowmini Varadhan). v3: Added support to set up IPv4 connection using mapped address (David Miller). Added support to set up connection between link local and non-link addresses. Various review comments from Santosh Shilimkar and Sowmini Varadhan. v2: Fixed bound and peer address scope mismatched issue. Added back rds_connect() IPv6 changes. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23rds: Changing IP address internal representation to struct in6_addrKa-Cheong Poon
This patch changes the internal representation of an IP address to use struct in6_addr. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. But RDS socket layer is not modified such that it still does not accept IPv6 address from an application. And RDS layer does not accept nor initiate IPv6 connections. v2: Fixed sparse warnings. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23Merge branch 'sched-introduce-chain-templates-support-with-offloading-to-mlxsw'David S. Miller
Jiri Pirko says: ==================== sched: introduce chain templates support with offloading to mlxsw For the TC clsact offload these days, some of HW drivers need to hold a magic ball. The reason is, with the first inserted rule inside HW they need to guess what fields will be used for the matching. If later on this guess proves to be wrong and user adds a filter with a different field to match, there's a problem. Mlxsw resolves it now with couple of patterns. Those try to cover as many match fields as possible. This aproach is far from optimal, both performance-wise and scale-wise. Also, there is a combination of filters that in certain order won't succeed. Most of the time, when user inserts filters in chain, he knows right away how the filters are going to look like - what type and option will they have. For example, he knows that he will only insert filters of type flower matching destination IP address. He can specify a template that would cover all the filters in the chain. This patchset is providing the possibility to user to provide such template to kernel and propagate it all the way down to device drivers. See the examples below. Create dummy device with clsact first: There is no chain present by by default: Add chain number 11 by explicit command: chain parent ffff: chain 11 Add filter to chain number 12 which does not exist. That will create implicit chain 12: chain parent ffff: chain 11 chain parent ffff: chain 12 Delete both chains: Add a chain with template of type flower allowing to insert rules matching on last 2 bytes of destination mac address: The chain with template is now showed in the list: chain parent ffff: flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 Add another chain (number 22) with template: chain parent ffff: flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 chain parent ffff: flower chain 22 eth_type ipv4 dst_ip 0.0.0.0/16 Add a filter that fits the template: Addition of filters that does not fit the template would fail: Error: cls_flower: Mask does not fit the template. We have an error talking to the kernel, -1 Error: cls_flower: Mask does not fit the template. We have an error talking to the kernel, -1 Additions of filters to chain 22: Error: cls_flower: Mask does not fit the template. We have an error talking to the kernel, -1 Error: cls_flower: Mask does not fit the template. We have an error talking to the kernel, -1 --- v3->v4: - patch 2: - new patch - patch 3: - new patch, derived from the previous v3 chaintemplate obj patch - patch 4: - only templates part as chains creation/deletion is now a separate patch - don't pass template priv as arg of "change" op - patch 6: - rebased on top of flower cvlan patch and ip tos/ttl patch - patch 7: - templave priv is no longer passed as an arg to "change" op - patch 11: - split from the originally single patch - patch 12: - split from the originally single patch v2->v3: - patch 7: - rebase on top of the reoffload patchset - patch 8: - rebase on top of the reoffload patchset v1->v2: - patch 8: - remove leftover extack arg in fl_hw_create_tmplt() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23selftests: forwarding: add tests for TC chain templatesJiri Pirko
Add basic sanity tests for TC chain templates. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23selftests: forwarding: add tests for TC chains creation adn destructionJiri Pirko
Add basic sanity tests for TC chains. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23selftests: forwarding: move shblock tc support check to a separate helperJiri Pirko
The shared block support is only needed for tc_shblock.sh. No need to require that for other test. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23mlxsw: spectrum: Implement chain template hintingJiri Pirko
Since cld_flower provides information about the filter template for specific chain, use this information in order to prepare a region. Use the template to find out what elements are going to be used and pass that down to mlxsw_sp_acl_tcam_group_add(). Later on, when the first filter is inserted, the mlxsw_sp_acl_tcam_group_use_patterns() function would use this element usage information instead of looking up a pattern. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: cls_flower: propagate chain teplate creation and destruction to ↵Jiri Pirko
drivers Introduce a couple of flower offload commands in order to propagate template creation/destruction events down to device drivers. Drivers may use this information to prepare HW in an optimal way for future filter insertions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: cls_flower: implement chain templatesJiri Pirko
Use the previously introduced template extension and implement callback to create, destroy and dump chain template. The existing parsing and dumping functions are re-used. Also, check if newly added filters fit the template if it is set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: cls_flower: change fl_init_dissector to accept mask and dissectorJiri Pirko
This function is going to be used for templates as well, so we need to pass the pointer separately. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: cls_flower: move key/mask dumping into a separate functionJiri Pirko
Push key/mask dumping from fl_dump() into a separate function fl_dump_key(), that will be reused for template dumping. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: introduce chain templatesJiri Pirko
Allow user to set a template for newly created chains. Template lock down the chain for particular classifier type/options combinations. The classifier needs to support templates, otherwise kernel would reply with error. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: introduce chain object to uapiJiri Pirko
Allow user to create, destroy, get and dump chain objects. Do that by extending rtnl commands by the chain-specific ones. User will now be able to explicitly create or destroy chains (so far this was done only automatically according the filter/act needs and refcounting). Also, the user will receive notification about any chain creation or destuction. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: Avoid implicit chain 0 creationJiri Pirko
Currently, chain 0 is implicitly created during block creation. However that does not align with chain object exposure, creation and destruction api introduced later on. So make the chain 0 behave the same way as any other chain and only create it when it is needed. Since chain 0 is somehow special as the qdiscs need to hold pointer to the first chain tp, this requires to move the chain head change callback infra to the block structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: sched: push ops lookup bits into tcf_proto_lookup_ops()Jiri Pirko
Push all bits that take care of ops lookup, including module loading outside tcf_proto_create() function, into tcf_proto_lookup_ops() Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23Merge branch 'cpsw-add-MQPRIO-and-CBS-Qdisc-offload'David S. Miller
Ivan Khoronzhuk says: ==================== net: ethernet: ti: cpsw: add MQPRIO and CBS Qdisc offload This series adds MQPRIO and CBS Qdisc offload for TI cpsw driver. It potentially can be used in audio video bridging (AVB) and time sensitive networking (TSN). Patchset was tested on AM572x EVM and BBB boards. Last patch from this series adds detailed description of configuration with examples. For consistency reasons, in role of talker and listener, tools from patchset "TSN: Add qdisc based config interface for CBS" were used and can be seen here: https://www.spinics.net/lists/netdev/msg460869.html Based on net-next/master v5..v4: - corrected typo of "am57xx" board name, no functional changes v4..v3: - nothing, just rebase v3..v2: - corrected typo of "shaper" word, no functional changes v2..v1: - changed name cpsw.txt on ti-cpsw.txt - changed name cpsw_set_tc() on cpsw_set_mqprio() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23Documentation: networking: cpsw: add MQPRIO & CBS offload examplesIvan Khoronzhuk
This document describes MQPRIO and CBS Qdisc offload configuration for cpsw driver based on examples. It potentially can be used in audio video bridging (AVB) and time sensitive networking (TSN). Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: ethernet: ti: cpsw: restore shaper configuration while down/upIvan Khoronzhuk
Need to restore shapers configuration after interface was down/up. This is needed as appropriate configuration is still replicated in kernel settings. This only shapers context restore, so vlan configuration should be restored by user if needed, especially for devices with one port where vlan frames are sent via ALE. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: ethernet: ti: cpsw: add CBS Qdisc offloadIvan Khoronzhuk
The cpsw has up to 4 FIFOs per port and upper 3 FIFOs can feed rate limited queue with shaping. In order to set and enable shaping for those 3 FIFOs queues the network device with CBS qdisc attached is needed. The CBS configuration is added for dual-emac/single port mode only, but potentially can be used in switch mode also, based on switchdev for instance. Despite the FIFO shapers can work w/o cpdma level shapers the base usage must be in combine with cpdma level shapers as described in TRM, that are set as maximum rates for interface queues with sysfs. One of the possible configuration with txq shapers and CBS shapers: Configured with echo RATE > /sys/class/net/eth0/queues/tx-0/tx_maxrate /--------------------------------------------------- / / cpdma level shapers +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | c7 | | c6 | | c5 | | c4 | | c3 | | c2 | | c1 | | c0 | \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \/ \/ \/ \/ \/ \/ \/ \/ +---------|------|------|------|-------------------------------------+ | +----+ | | +---+ | | | +----+ | | | | v v v v | | +----+ +----+ +----+ +----+ p p+----+ +----+ +----+ +----+ | | | | | | | | | | o o| | | | | | | | | | | f3 | | f2 | | f1 | | f0 | r CPSW r| f3 | | f2 | | f1 | | f0 | | | | | | | | | | | t t| | | | | | | | | | \ / \ / \ / \ / 0 1\ / \ / \ / \ / | | \ X \ / \ / \ / \ / \ / \ / \ / | | \/ \ \/ \/ \/ \/ \/ \/ \/ | +-------\------------------------------------------------------------+ \ \ FIFO shaper, set with CBS offload added in this patch, \ FIFO0 cannot be rate limited ------------------------------------------------------ CBS shaper configuration is supposed to be used with root MQPRIO Qdisc offload allowing to add sk_prio->tc->txq maps that direct traffic to appropriate tx queue and maps L2 priority to FIFO shaper. The CBS shaper is intended to be used for AVB where L2 priority (pcp field) is used to differentiate class of traffic. So additionally vlan needs to be created with appropriate egress sk_prio->l2 prio map. If CBS has several tx queues assigned to it, the sum of their bandwidth has not overlap bandwidth set for CBS. It's recomended the CBS bandwidth to be a little bit more. The CBS shaper is configured with CBS qdisc offload interface using tc tool from iproute2 packet. For instance: $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1 $ tc -g class show dev eth0 +---(100:ffe2) mqprio |    +---(100:3) mqprio |    +---(100:4) mqprio |     +---(100:ffe1) mqprio |    +---(100:2) mqprio |     +---(100:ffe0) mqprio     +---(100:1) mqprio $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1440 \ hicredit 60 sendslope -960000 idleslope 40000 offload 1 $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \ hicredit 62 sendslope -980000 idleslope 20000 offload 1 The above code set CBS shapers for tc0 and tc1, for that txq0 and txq1 is used. Pay attention, the real set bandwidth can differ a bit due to discreteness of configuration parameters. Here parameters like locredit, hicredit and sendslope are ignored internally and are supposed to be set with assumption that maximum frame size for frame - 1500. It's supposed that interface speed is not changed while reconnection, not always is true, so inform user in case speed of interface was changed, as it can impact on dependent shapers configuration. For more examples see Documentation. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: ethernet: ti: cpsw: add MQPRIO Qdisc offloadIvan Khoronzhuk
That's possible to offload vlan to tc priority mapping with assumption sk_prio == L2 prio. Example: $ ethtool -L eth0 rx 1 tx 4 $ qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1 $ tc -g class show dev eth0 +---(100:ffe2) mqprio |    +---(100:3) mqprio |    +---(100:4) mqprio |     +---(100:ffe1) mqprio |    +---(100:2) mqprio |     +---(100:ffe0) mqprio     +---(100:1) mqprio Here, 100:1 is txq0, 100:2 is txq1, 100:3 is txq2, 100:4 is txq3 txq0 belongs to tc0, txq1 to tc1, txq2 and txq3 to tc2 The offload part only maps L2 prio to classes of traffic, but not to transmit queues, so to direct traffic to traffic class vlan has to be created with appropriate egress map. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: ethernet: ti: cpdma: fit rated channels in backward orderIvan Khoronzhuk
According to TRM tx rated channels should be in 7..0 order, so correct it. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23net: ethernet: ti: cpsw: use cpdma channels in backward order for txqIvan Khoronzhuk
The cpdma channel highest priority is from hi to lo number. The driver has limited number of descriptors that are shared between number of cpdma channels. Number of queues can be tuned with ethtool, that allows to not spend descriptors on not needed cpdma channels. In AVB usually only 2 tx queues can be enough with rate limitation. The rate limitation can be used only for hi priority queues. Thus, to use only 2 queues the 8 has to be created. It's wasteful. So, in order to allow using only needed number of rate limited tx queues, save resources, and be able to set rate limitation for them, let assign tx cpdma channels in backward order to queues. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-23Merge tag 'mlx5e-updates-2018-07-18-v2' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5e-updates-2018-07-18 This series includes update for mlx5e net device driver. 1) From Feras Daoud, Added the support for firmware log tracing, first by introducing the firmware API needed for the task and then For each PF do the following: 1- Allocate memory for the tracer strings database and read it from the FW to the SW. 2- Allocate and dma map tracer buffers. Traces that will be written into the buffer will be parsed as a group of one or more traces, referred to as trace message. The trace message represents a C-like printf string. Once a new trace is available FW will generate an event indicates new trace/s are available and the driver will parse them and dump them using tracepoints event tracing Enable mlx5 fw tracing by: echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable Read traces by: cat /sys/kernel/debug/tracing/trace 2) From Roi Dayan, Remove redundant WARN when we cannot find neigh entry 3) From Jianbo Liu, TC double vlan support - Support offloading tc double vlan headers match - Support offloading double vlan push/pop tc actions 4) From Boris, re-visit UDP GSO, remove the splitting of UDP_GSO_L4 packets in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-24bpf: btf: Ensure the member->offset is in the right orderMartin KaFai Lau
This patch ensures the member->offset of a struct is in the correct order (i.e the later member's offset cannot go backward). The current "pahole -J" BTF encoder does not generate something like this. However, checking this can ensure future encoder will not violate this. Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-24netfilter: nf_tables: move dumper state allocation into ->startFlorian Westphal
Shaochun Chen points out we leak dumper filter state allocations stored in dump_control->data in case there is an error before netlink sets cb_running (after which ->done will be called at some point). In order to fix this, add .start functions and do the allocations there. ->done is going to clean up, and in case error occurs before ->start invocation no cleanups need to be done anymore. Reported-by: shaochun chen <cscnull@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-23net/mlx5e: Use PARTIAL_GSO for UDP segmentationBoris Pismenny
This patch removes the splitting of UDP_GSO_L4 packets in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature. Thus, the network stack is not responsible for splitting the packet into two. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-23net/mlx5e: Support offloading double vlan push/pop tc actionsJianbo Liu
As we can configure two push/pop actions in one flow table entry, add support to offload those double vlan actions in a rule to HW. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-23net/mlx5e: Refactor tc vlan push/pop actions offloadingJianbo Liu
Extract actions offloading code to a new function, and also extend data structures for double vlan actions. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-23net/mlx5e: Support offloading tc double vlan headers matchJianbo Liu
We can match on both outer and inner vlan tags, add support for offloading that. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-23net/mlx5e: Remove redundant WARN when we cannot find neigh entryRoi Dayan
It is possible for neigh entry not to exist if it was cleaned already. When we bring down an interface the neigh gets deleted but it could be that our listener for neigh event to clear the encap valid bit didn't start yet and the neigh update last used work is started first. In this scenario the encap entry has valid bit set but the neigh entry doesn't exist. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-23net/mlx5: FW tracer, Add debug printsSaeed Mahameed
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>