summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-08-31Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "I2C has a bunch of driver fixes and a core improvement to make the on-going API transition more robust" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: mediatek: disable zero-length transfers for mt8183 i2c: iproc: Stop advertising support of SMBUS quick cmd MAINTAINERS: i2c mv64xxx: Update documentation path i2c: piix4: Fix port selection for AMD Family 16h Model 30h i2c: designware: Synchronize IRQs when unregistering slave client i2c: i801: Avoid memory leak in check_acpi_smo88xx_device() i2c: make i2c_unregister_device() ERR_PTR safe
2019-08-31Merge tag 'trace-v5.3-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Small fixes and minor cleanups for tracing: - Make exported ftrace function not static - Fix NULL pointer dereference in reading probes as they are created - Fix NULL pointer dereference in k/uprobe clean up path - Various documentation fixes" * tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Correct kdoc formats ftrace/x86: Remove mcount() declaration tracing/probe: Fix null pointer dereference tracing: Make exported ftrace_set_clr_event non-static ftrace: Check for successful allocation of hash ftrace: Check for empty hash and comment the race with registering probes ftrace: Fix NULL pointer dereference in t_probe_next()
2019-08-31Merge tag 'riscv/for-v5.3-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fix from Paul Walmsley: "One significant fix for 32-bit RISC-V systems: Fix the RV32 memory map to prevent userspace from corrupting the FIXMAP area. Without this patch, the system can crash very early during the boot" * tag 'riscv/for-v5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: Fix FIXMAP area corruption on RV32 systems
2019-08-31Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Radim Krčmář: "PPC: - Fix bug which could leave locks held in the host on return to a guest. x86: - Prevent infinitely looping emulation of a failing syscall while single stepping. - Do not crash the host when nesting is disabled" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Don't update RIP or do single-step on faulting emulation KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling
2019-08-31Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc mm fixes from Andrew Morton: "7 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: memcontrol: fix percpu vmstats and vmevents flush mm, memcg: do not set reclaim_state on soft limit reclaim mailmap: add aliases for Dmitry Safonov mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n mm: memcontrol: flush percpu slab vmstats on kmem offlining
2019-08-31tracing: Correct kdoc formatsJakub Kicinski
Fix the following kdoc warnings: kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single' kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single' kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single' kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer' kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo' kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo' kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk' kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk' kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk' Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31ftrace/x86: Remove mcount() declarationJisheng Zhang
Commit 562e14f72292 ("ftrace/x86: Remove mcount support") removed the support for using mcount, so we could remove the mcount() declaration to clean up. Link: http://lkml.kernel.org/r/20190826170150.10f101ba@xhacker.debian Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31tracing/probe: Fix null pointer dereferenceXinpeng Liu
BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0 Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746 trace_probe_cleanup+0x8d/0xd0 free_trace_kprobe.part.14+0x15/0x50 alloc_trace_kprobe+0x23e/0x250 Link: http://lkml.kernel.org/r/1565220563-980-1-git-send-email-danielliu861@gmail.com Fixes: e3dc9f898ef9c ("tracing/probe: Add trace_event_call accesses APIs") Signed-off-by: Xinpeng Liu <danielliu861@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31tracing: Make exported ftrace_set_clr_event non-staticDenis Efremov
The function ftrace_set_clr_event is declared static and marked EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the function was decided to be a part of API, this commit removes the static attribute and adds the declaration to the header. Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com Fixes: f45d1225adb04 ("tracing: Kernel access to Ftrace instances") Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30udp: Remove unlikely() from IS_ERR*() conditionDenis Efremov
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30net/mlx5e: Remove unlikely() from WARN*() conditionDenis Efremov
"unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: Boris Pismenny <borisp@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a potential crash in the ccp driver" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ccp - Ignore unconfigured CCP device on suspend/resume
2019-08-30Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"Linus Torvalds
Commit dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") made the kfifo code round the number of elements up. That was good for __kfifo_alloc(), but it's actually wrong for __kfifo_init(). The difference? __kfifo_alloc() will allocate the rounded-up number of elements, but __kfifo_init() uses an allocation done by the caller. We can't just say "use more elements than the caller allocated", and have to round down. The good news? All the normal cases will be using power-of-two arrays anyway, and most users of kfifo's don't use kfifo_init() at all, but one of the helper macros to declare a KFIFO that enforce the proper power-of-two behavior. But it looks like at least ibmvscsis might be affected. The bad news? Will Deacon refers to an old thread and points points out that the memory ordering in kfifo's is questionable. See https://lore.kernel.org/lkml/20181211034032.32338-1-yuleixzhang@tencent.com/ for more. Fixes: dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") Reported-by: laokz <laokz@foxmail.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm: memcontrol: fix percpu vmstats and vmevents flushShakeel Butt
Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs. Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm, memcg: do not set reclaim_state on soft limit reclaimMichal Hocko
Adric Blake has noticed[1] the following warning: WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40 [...] Call Trace: mem_cgroup_shrink_node+0x9b/0x1d0 mem_cgroup_soft_limit_reclaim+0x10c/0x3a0 balance_pgdat+0x276/0x540 kswapd+0x200/0x3f0 ? wait_woken+0x80/0x80 kthread+0xfd/0x130 ? balance_pgdat+0x540/0x540 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 ---[ end trace 727343df67b2398a ]--- which tells us that soft limit reclaim is about to overwrite the reclaim_state configured up in the call chain (kswapd in this case but the direct reclaim is equally possible). This means that reclaim stats would get misleading once the soft reclaim returns and another reclaim is done. Fix the warning by dropping set_task_reclaim_state from the soft reclaim which is always called with reclaim_state set up. [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Adric Blake <promarbler14@gmail.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mailmap: add aliases for Dmitry SafonovDmitry Safonov
I don't work for Virtuozzo or Samsung anymore and I've noticed that they have started sending annoying html email-replies. And I prioritize my personal emails over work email box, so while at it add an entry for Arista too - so I can reply faster when needed. Link: http://lkml.kernel.org/r/20190827220346.11123-1-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolateGustavo A. R. Silva
Fix lock/unlock imbalance by unlocking *zhdr* before return. Addresses Coverity ID 1452811 ("Missing unlock") Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor Fixes: d776aaa9895e ("mm/z3fold.c: fix race between migration and destruction") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync ↵Roman Gushchin
with the hierarchical ones" Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters. That's good for displaying in memory.stat, but brings a serious regression into the reclaim process. One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time. Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random. I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch. Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup. Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm/zsmalloc.c: fix build when CONFIG_COMPACTION=nAndrew Morton
Fixes: 701d678599d0c1 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com Reported-by: kbuild test robot <lkp@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm: memcontrol: flush percpu slab vmstats on kmem offliningRoman Gushchin
I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats. At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead. To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages. Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Spurious warning when loading rules using the physdev match, from Todd Seidelmann. 2) Fix FTP conntrack helper debugging output, from Thomas Jarosch. 3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs, from Florian Westphal. 4) Clear skbuff timestamp from the flowtable datapath, also from Florian. 5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-08-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix 32-bit zero-extension during constant blinding which has been causing a regression on ppc64, from Naveen. 2) Fix a latency bug in nfp driver when updating stack index register, from Jiong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30bnxt_en: Fix compile error regression with CONFIG_BNXT_SRIOV not set.Michael Chan
Add a new function bnxt_get_registered_vfs() to handle the work of getting the number of registered VFs under #ifdef CONFIG_BNXT_SRIOV. The main code will call this function and will always work correctly whether CONFIG_BNXT_SRIOV is set or not. Fixes: 230d1f0de754 ("bnxt_en: Handle firmware reset.") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31Merge branch 'bpf-xdp-unaligned-chunk'Daniel Borkmann
Kevin Laatz says: ==================== This patch set adds the ability to use unaligned chunks in the XDP umem. Currently, all chunk addresses passed to the umem are masked to be chunk size aligned (max is PAGE_SIZE). This limits where we can place chunks within the umem as well as limiting the packet sizes that are supported. The changes in this patch set removes these restrictions, allowing XDP to be more flexible in where it can place a chunk within a umem. By relaxing where the chunks can be placed, it allows us to use an arbitrary buffer size and place that wherever we have a free address in the umem. These changes add the ability to support arbitrary frame sizes up to 4k (PAGE_SIZE) and make it easy to integrate with other existing frameworks that have their own memory management systems, such as DPDK. In DPDK, for example, there is already support for AF_XDP with zero-copy. However, with this patch set the integration will be much more seamless. You can find the DPDK AF_XDP driver at: https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp Since we are now dealing with arbitrary frame sizes, we need also need to update how we pass around addresses. Currently, the addresses can simply be masked to 2k to get back to the original address. This becomes less trivial when using frame sizes that are not a 'power of 2' size. This patch set modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr field for an offset value, leaving the lower 48-bits for the address (this leaves us with 256 Terabytes, which should be enough!). We only need to use the upper 16-bits to store the offset when running in unaligned mode. Rather than adding the offset (headroom etc) to the address, we will store it in the upper 16-bits of the address field. This way, we can easily add the offset to the address where we need it, using some bit manipulation and addition, and we can also easily get the original address wherever we need it (for example in i40e_zca_free) by simply masking to get the lower 48-bits of the address field. The patch set was tested with the following set up: - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) - Driver: i40e - Application: xdpsock with l2fwd (single interface) - Turbo disabled in BIOS There are no changes to performance before and after these patches for SKB mode and Copy mode. Zero-copy mode saw a performance degradation of ~1.5%. This patch set has been applied against commit 0bb52b0dfc88 ("tools: bpftool: add 'bpftool map freeze' subcommand") Structure of the patch set: Patch 1: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in i40e. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 2: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in ixgbe. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 3: - Add infrastructure for unaligned chunks. Since we are dealing with unaligned chunks that could potentially cross a physical page boundary, we add checks to keep track of that information. We can later use this information to correctly handle buffers that are placed at an address where they cross a page boundary. This patch also modifies the existing Rx and Tx functions to use the new descriptor format. To handle addresses correctly, we need to mask appropriately based on whether we are in aligned or unaligned mode. Patch 4: - This patch updates the i40e driver to make use of the new descriptor format. Patch 5: - This patch updates the ixgbe driver to make use of the new descriptor format. Patch 6: - This patch updates the mlx5e driver to make use of the new descriptor format. These changes are required to handle the new descriptor format and for unaligned chunks support. Patch 7: - This patch allows XSK frames smaller than page size in the mlx5e driver. Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. Patch 8: - Add flags for umem configuration to libbpf. Since we increase the size of the struct by adding flags, we also need to add the ABI versioning in this patch. Patch 9: - Modify xdpsock application to add a command line option for unaligned chunks Patch 10: - Since we can now run the application in unaligned chunk mode, we need to make sure we recycle the buffers appropriately. Patch 11: - Adds hugepage support to the xdpsock application Patch 12: - Documentation update to include the unaligned chunk scenario. We need to explicitly state that the incoming addresses are only masked in the aligned chunk mode and not the unaligned chunk mode. v2: - fixed checkpatch issues - fixed Rx buffer recycling for unaligned chunks in xdpsock - removed unused defines - fixed how chunk_size is calculated in xsk_diag.c - added some performance numbers to cover letter - modified descriptor format to make it easier to retrieve original address - removed patch adding off_t off to the zero copy allocator. This is no longer needed with the new descriptor format. v3: - added patch for mlx5 driver changes needed for unaligned chunks - moved offset handling to new helper function - changed value used for the umem chunk_mask. Now using the new descriptor format to save us doing the calculations in a number of places meaning more of the code is left unchanged while adding unaligned chunk support. v4: - reworked the next_pg_contig field in the xdp_umem_page struct. We now use the low 12 bits of the addr for flags rather than adding an extra field in the struct. - modified unaligned chunks flag define - fixed page_start calculation in __xsk_rcv_memcpy(). - move offset handling to the xdp_umem_get_* functions - modified the len field in xdp_umem_reg struct. We now use 16 bits from this for the flags field. - fixed headroom addition to handle in the mlx5e driver - other minor changes based on review comments v5: - Added ABI versioning in the libbpf patch - Removed bitfields in the xdp_umem_reg struct. Adding new flags field. - Added accessors for getting addr and offset. - Added helper function for adding the offset to the addr. - Fixed conflicts with 'bpf-af-xdp-wakeup' which was merged recently. - Fixed typo in mlx driver patch. - Moved libbpf patch to later in the set (7/11, just before the sample app changes) v6: - Added support for XSK frames smaller than page in mlx5e driver (Maxim Mikityanskiy <maximmi@mellanox.com). - Fixed offset handling in xsk_generic_rcv. - Added check for base address in xskq_is_valid_addr_unaligned. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31doc/af_xdp: include unaligned chunk caseKevin Laatz
The addition of unaligned chunks mode, the documentation needs to be updated to indicate that the incoming addr to the fill ring will only be masked if the user application is run in the aligned chunk mode. This patch also adds a line to explicitly indicate that the incoming addr will not be masked if running the user application in the unaligned chunk mode. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: use hugepages in xdpsock appKevin Laatz
This patch modifies xdpsock to use mmap instead of posix_memalign. With this change, we can use hugepages when running the application in unaligned chunks mode. Using hugepages makes it more likely that we have physically contiguous memory, which supports the unaligned chunk mode better. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: add buffer recycling for unaligned chunks to xdpsockKevin Laatz
This patch adds buffer recycling support for unaligned buffers. Since we don't mask the addr to 2k at umem_reg in unaligned mode, we need to make sure we give back the correct (original) addr to the fill queue. We achieve this using the new descriptor format and associated masks. The new format uses the upper 16-bits for the offset and the lower 48-bits for the addr. Since we have a field for the offset, we no longer need to modify the actual address. As such, all we have to do to get back the original address is mask for the lower 48 bits (i.e. strip the offset and we get the address on it's own). Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: add unaligned chunks mode support to xdpsockKevin Laatz
This patch adds support for the unaligned chunks mode. The addition of the unaligned chunks option will allow users to run the application with more relaxed chunk placement in the XDP umem. Unaligned chunks mode can be used with the '-u' or '--unaligned' command line options. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31libbpf: add flags to umem configKevin Laatz
This patch adds a 'flags' field to the umem_config and umem_reg structs. This will allow for more options to be added for configuring umems. The first use for the flags field is to add a flag for unaligned chunks mode. These flags can either be user-provided or filled with a default. Since we change the size of the xsk_umem_config struct, we need to version the ABI. This patch includes the ABI versioning for xsk_umem__create. The Makefile was also updated to handle multiple function versions in check-abi. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31net/mlx5e: Allow XSK frames smaller than a pageMaxim Mikityanskiy
Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. The code that checks `mtu + headroom <= XSK frame size` is modified accordingly. Any frame size between 2048 and PAGE_SIZE is accepted. Functions that worked with pages only now work with XSK frames, even if their size is different from PAGE_SIZE. With XSK queues, regardless of the frame size, Striding RQ uses the stride size of PAGE_SIZE, and UMR MTTs are posted using starting addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no packet data will overlap with other frames. UMR MTT size is made equal to the stride size of the RQ, because UMEM frames may come in random order, and we need to handle them one by one. PAGE_SIZE is just a power of two that is bigger than any allowed XSK frame size, and also it doesn't require making additional changes to the code. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31mlx5e: modify driver for handling offsetsKevin Laatz
With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31ixgbe: modify driver for handling offsetsKevin Laatz
With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31i40e: modify driver for handling offsetsKevin Laatz
With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31xsk: add support to allow unaligned chunk placementKevin Laatz
Currently, addresses are chunk size aligned. This means, we are very restricted in terms of where we can place chunk within the umem. For example, if we have a chunk size of 2k, then our chunks can only be placed at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0). This patch introduces the ability to use unaligned chunks. With these changes, we are no longer bound to having to place chunks at a 2k (or whatever your chunk size is) interval. Since we are no longer dealing with aligned chunks, they can now cross page boundaries. Checks for page contiguity have been added in order to keep track of which pages are followed by a physically contiguous page. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31ixgbe: simplify Rx buffer recycleKevin Laatz
Currently, the dma, addr and handle are modified when we reuse Rx buffers in zero-copy mode. However, this is not required as the inputs to the function are copies, not the original values themselves. As we use the copies within the function, we can use the original 'obi' values directly without having to mask and add the headroom. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31i40e: simplify Rx buffer recycleKevin Laatz
Currently, the dma, addr and handle are modified when we reuse Rx buffers in zero-copy mode. However, this is not required as the inputs to the function are copies, not the original values themselves. As we use the copies within the function, we can use the original 'old_bi' values directly without having to mask and add the headroom. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31selftests/bpf: Fix a typo in test_offload.pyMasanari Iida
This patch fix a spelling typo in test_offload.py Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31bpf: fix error check in bpf_tcp_gen_syncookiePetar Penkov
If a SYN cookie is not issued by tcp_v#_gen_syncookie, then the return value will be exactly 0, rather than <= 0. Let's change the check to reflect that, especially since mss is an unsigned value and cannot be negative. Fixes: 70d66244317e ("bpf: add bpf_tcp_gen_syncookie helper") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Petar Penkov <ppenkov@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31Merge branch 'bpf-nfp-map-op-cache'Daniel Borkmann
Jakub Kicinski says: ==================== This set adds a small batching and cache mechanism to the driver. Map dumps require two operations per element - get next, and lookup. Each of those needs a round trip to the device, and on a loaded system scheduling out and in of the dumping process. This set makes the driver request a number of entries at the same time, and if no operation which would modify the map happens from the host side those entries are used to serve lookup requests for up to 250us, at which point they are considered stale. This set has been measured to provide almost 4x dumping speed improvement, Jaco says: OLD dump times 500 000 elements: 26.1s 1 000 000 elements: 54.5s NEW dump times 500 000 elements: 7.6s 1 000 000 elements: 16.5s ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31nfp: bpf: add simple map op cacheJakub Kicinski
Each get_next and lookup call requires a round trip to the device. However, the device is capable of giving us a few entries back, instead of just one. In this patch we ask for a small yet reasonable number of entries (4) on every get_next call, and on subsequent get_next/lookup calls check this little cache for a hit. The cache is only kept for 250us, and is invalidated on every operation which may modify the map (e.g. delete or update call). Note that operations may be performed simultaneously, so we have to keep track of operations in flight. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31nfp: bpf: rework MTU checkingJakub Kicinski
If control channel MTU is too low to support map operations a warning will be printed. This is not enough, we want to make sure probe fails in such scenario, as this would clearly be a faulty configuration. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31Merge branch 'bpf-bpftool-build-improvements'Daniel Borkmann
Quentin Monnet says: ==================== This set attempts to make it easier to build bpftool, in particular when passing a specific output directory. This is a follow-up to the conversation held last month by Lorenz, Ilya and Jakub [0]. The first patch is a minor fix to bpftool's Makefile, regarding the retrieval of kernel version (which currently prints a non-relevant make warning on some invocations). Second patch improves the Makefile commands to support more "make" invocations, or to fix building with custom output directory. On Jakub's suggestion, a script is also added to BPF selftests in order to keep track of the supported build variants. Building bpftool with "make tools/bpf" from the top of the repository generates files in "libbpf/" and "feature/" directories under tools/bpf/ and tools/bpf/bpftool/. The third patch ensures such directories are taken care of on "make clean", and add them to the relevant .gitignore files. At last, fourth patch is a sligthly modified version of Ilya's fix regarding libbpf.a appearing twice on the linking command for bpftool. [0] https://lore.kernel.org/bpf/CACAyw9-CWRHVH3TJ=Tke2x8YiLsH47sLCijdp=V+5M836R9aAA@mail.gmail.com/ v2: - Return error from check script if one of the make invocations returns non-zero (even if binary is successfully produced). - Run "make clean" from bpf/ and not only bpf/bpftool/ in that same script, when relevant. - Add a patch to clean up generated "feature/" and "libbpf/" directories. ==================== Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31tools: bpftool: do not link twice against libbpf.a in MakefileQuentin Monnet
In bpftool's Makefile, $(LIBS) includes $(LIBBPF), therefore the library is used twice in the linking command. No need to have $(LIBBPF) (from $^) on that command, let's do with "$(OBJS) $(LIBS)" (but move $(LIBBPF) _before_ the -l flags in $(LIBS)). Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31tools: bpf: account for generated feature/ and libbpf/ directoriesQuentin Monnet
When building "tools/bpf" from the top of the Linux repository, the build system passes a value for the $(OUTPUT) Makefile variable to tools/bpf/Makefile and tools/bpf/bpftool/Makefile, which results in generating "libbpf/" (for bpftool) and "feature/" (bpf and bpftool) directories inside the tree. This commit adds such directories to the relevant .gitignore files, and edits the Makefiles to ensure they are removed on "make clean". The use of "rm" is also made consistent throughout those Makefiles (relies on the $(RM) variable, use "--" to prevent interpreting $(OUTPUT)/$(DESTDIR) as options. v2: - New patch. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31tools: bpftool: improve and check builds for different make invocationsQuentin Monnet
There are a number of alternative "make" invocations that can be used to compile bpftool. The following invocations are expected to work: - through the kbuild system, from the top of the repository (make tools/bpf) - by telling make to change to the bpftool directory (make -C tools/bpf/bpftool) - by building the BPF tools from tools/ (cd tools && make bpf) - by running make from bpftool directory (cd tools/bpf/bpftool && make) Additionally, setting the O or OUTPUT variables should tell the build system to use a custom output path, for each of these alternatives. The following patch fixes the following invocations: $ make tools/bpf $ make tools/bpf O=<dir> $ make -C tools/bpf/bpftool OUTPUT=<dir> $ make -C tools/bpf/bpftool O=<dir> $ cd tools/ && make bpf O=<dir> $ cd tools/bpf/bpftool && make OUTPUT=<dir> $ cd tools/bpf/bpftool && make O=<dir> After this commit, the build still fails for two variants when passing the OUTPUT variable: $ make tools/bpf OUTPUT=<dir> $ cd tools/ && make bpf OUTPUT=<dir> In order to remember and check what make invocations are supposed to work, and to document the ones which do not, a new script is added to the BPF selftests. Note that some invocations require the kernel to be configured, so the script skips them if no .config file is found. v2: - In make_and_clean(), set $ERROR to 1 when "make" returns non-zero, even if the binary was produced. - Run "make clean" from the correct directory (bpf/ instead of bpftool/, when relevant). Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31tools: bpftool: ignore make built-in rules for getting kernel versionQuentin Monnet
Bpftool calls the toplevel Makefile to get the kernel version for the sources it is built from. But when the utility is built from the top of the kernel repository, it may dump the following error message for certain architectures (including x86): $ make tools/bpf [...] make[3]: *** [checkbin] Error 1 [...] This does not prevent bpftool compilation, but may feel disconcerting. The "checkbin" arch-dependent target is not supposed to be called for target "kernelversion", which is a simple "echo" of the version number. It turns out this is caused by the make invocation in tools/bpf/bpftool, which attempts to find implicit rules to apply. Extract from debug output: Reading makefiles... Reading makefile 'Makefile'... Reading makefile 'scripts/Kbuild.include' (search path) (no ~ expansion)... Reading makefile 'scripts/subarch.include' (search path) (no ~ expansion)... Reading makefile 'arch/x86/Makefile' (search path) (no ~ expansion)... Reading makefile 'scripts/Makefile.kcov' (search path) (no ~ expansion)... Reading makefile 'scripts/Makefile.gcc-plugins' (search path) (no ~ expansion)... Reading makefile 'scripts/Makefile.kasan' (search path) (no ~ expansion)... Reading makefile 'scripts/Makefile.extrawarn' (search path) (no ~ expansion)... Reading makefile 'scripts/Makefile.ubsan' (search path) (no ~ expansion)... Updating makefiles.... Considering target file 'scripts/Makefile.ubsan'. Looking for an implicit rule for 'scripts/Makefile.ubsan'. Trying pattern rule with stem 'Makefile.ubsan'. [...] Trying pattern rule with stem 'Makefile.ubsan'. Trying implicit prerequisite 'scripts/Makefile.ubsan.o'. Looking for a rule with intermediate file 'scripts/Makefile.ubsan.o'. Avoiding implicit rule recursion. Trying pattern rule with stem 'Makefile.ubsan'. Trying rule prerequisite 'prepare'. Trying rule prerequisite 'FORCE'. Found an implicit rule for 'scripts/Makefile.ubsan'. Considering target file 'prepare'. File 'prepare' does not exist. Considering target file 'prepare0'. File 'prepare0' does not exist. Considering target file 'archprepare'. File 'archprepare' does not exist. Considering target file 'archheaders'. File 'archheaders' does not exist. Finished prerequisites of target file 'archheaders'. Must remake target 'archheaders'. Putting child 0x55976f4f6980 (archheaders) PID 31743 on the chain. To avoid that, pass the -r and -R flags to eliminate the use of make built-in rules (and while at it, built-in variables) when running command "make kernelversion" from bpftool's Makefile. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31bpf: s390: add JIT support for multi-function programsYauheni Kaliuta
This adds support for bpf-to-bpf function calls in the s390 JIT compiler. The JIT compiler converts the bpf call instructions to native branch instructions. After a round of the usual passes, the start addresses of the JITed images for the callee functions are known. Finally, to fixup the branch target addresses, we need to perform an extra pass. Because of the address range in which JITed images are allocated on s390, the offsets of the start addresses of these images from __bpf_call_base are as large as 64 bits. So, for a function call, the imm field of the instruction cannot be used to determine the callee's address. Use bpf_jit_get_func_addr() helper instead. The patch borrows a lot from: commit 8c11ea5ce13d ("bpf, arm64: fix getting subprog addr from aux for calls") commit e2c95a61656d ("bpf, ppc64: generalize fetching subprog into bpf_jit_get_func_addr") commit 8484ce8306f9 ("bpf: powerpc64: add JIT support for multi-function programs") (including the commit message). test_verifier (5.3-rc6 with CONFIG_BPF_JIT_ALWAYS_ON=y): without patch: Summary: 1501 PASSED, 0 SKIPPED, 47 FAILED with patch: Summary: 1540 PASSED, 0 SKIPPED, 8 FAILED Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-30Merge branch 'Fixes-for-unlocked-cls-hardware-offload-API-refactoring'David S. Miller
Vlad Buslov says: ==================== Fixes for unlocked cls hardware offload API refactoring Two fixes for my "Refactor cls hardware offload API to support rtnl-independent drivers" series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30net/mlx5e: Move local var definition into ifdef blockVlad Buslov
New local variable "struct flow_block_offload *f" was added to mlx5e_setup_tc() in recent rtnl lock removal patches. The variable is used in code that is only compiled when CONFIG_MLX5_ESWITCH is enabled. This results compilation warning about unused variable when CONFIG_MLX5_ESWITCH is not set. Move the variable definition into eswitch-specific code block from the beginning of mlx5e_setup_tc() function. Fixes: c9f14470d048 ("net: sched: add API for registering unlocked offload block callbacks") Reported-by: tanhuazhong <tanhuazhong@huawei.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30net: sched: cls_matchall: cleanup flow_action before deallocatingVlad Buslov
Recent rtnl lock removal patch changed flow_action infra to require proper cleanup besides simple memory deallocation. However, matchall classifier was not updated to call tc_cleanup_flow_action(). Add proper cleanup to mall_replace_hw_filter() and mall_reoffload(). Fixes: 5a6ff4b13d59 ("net: sched: take reference to action dev before calling offloads") Reported-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>