summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-08-31tools/power turbostat: Add support for Hygon Fam 18h (Dhyana) RAPLPu Wen
Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD Fam 17h (Zen) RAPL") and the commit 3316f99a9f1b68c578c5 ("tools/power turbostat: Also read package power on AMD F17h (Zen)") add AMD Fam 17h RAPL support. Hygon Family 18h(Dhyana) support RAPL in bit 14 of CPUID 0x80000007 EDX, and has MSRs RAPL_PWR_UNIT/CORE_ENERGY_STAT/PKG_ENERGY_STAT. So add Hygon Dhyana Family 18h support for RAPL. Already tested on Hygon multi-node systems and it shows correct per-core energy usage and the total package power. Signed-off-by: Pu Wen <puwen@hygon.cn> Reviewed-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: Fix caller parameter of get_tdp_amd()Pu Wen
Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD Fam 17h (Zen) RAPL") add a function get_tdp_amd(), the parameter is CPU family. But the rapl_probe_amd() function use wrong model parameter. Fix the wrong caller parameter of get_tdp_amd() to use family. Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: Pu Wen <puwen@hygon.cn> Reviewed-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: Fix CPU%C1 display valueSrinivas Pandruvada
In some case C1% will be wrong value, when platform doesn't have MSR for C1 residency. For example: Core CPU CPU%c1 - - 100.00 0 0 100.00 0 2 100.00 1 1 100.00 1 3 100.00 But adding Busy% will fix this Core CPU Busy% CPU%c1 - - 99.77 0.23 0 0 99.77 0.23 0 2 99.77 0.23 1 1 99.77 0.23 1 3 99.77 0.23 This issue can be reproduced on most of the recent systems including Broadwell, Skylake and later. This is because if we don't select Busy% or Avg_MHz or Bzy_MHz then mperf value will not be read from MSR, so it will be 0. But this is required for C1% calculation when MSR for C1 residency is not present. Same is true for C3, C6 and C7 column selection. So add another define DO_BIC_READ(), which doesn't depend on user column selection and use for mperf, C3, C6 and C7 related counters. So when there is no platform support for C1 residency counters, we still read these counters, if the CPU has support and user selected display of CPU%c1. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: do not enforce 1msArtem Bityutskiy
Turbostat works by taking a snapshot of counters, sleeping, taking another snapshot, calculating deltas, and printing out the table. The sleep time is controlled via -i option or by user sending a signal or a character to stdin. In the latter case, turbostat always adds 1 ms sleep before it reads the counters, in order to avoid larger imprecisions in the results in prints. While the 1 ms delay may be a good idea for a "dumb" user, it is a problem for an "aware" user. I do thousands and thousands of measurements over a short period of time (like 2ms), and turbostat unconditionally adds a 1ms to my interval, so I cannot get what I really need. This patch removes the unconditional 1ms sleep. This is an expert user tool, after all, and non-experts will unlikely ever use it in the non-fixed interval mode anyway, so I think it is OK to remove the 1ms delay. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: read from pipes tooArtem Bityutskiy
Commit '47936f944e78 tools/power turbostat: fix printing on input' make a valid fix, but it completely disabled piped stdin support, which is a valuable use-case. Indeed, if stdin is a pipe, turbostat won't read anything from it, so it becomes impossible to get turbostat output at user-defined moments, instead of the regular intervals. There is no reason why this should works for terminals, but not for pipes. This patch improves the situation. Instead of ignoring pipes, we read data from them but gracefully handle the EOF case. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: Add Ice Lake NNPI supportRajneesh Bhardwaj
This enables turbostat utility on Ice Lake NNPI SoC. Link: https://lkml.org/lkml/2019/6/5/1034 Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: rename has_hsw_msrs()Len Brown
Perhaps if this more descriptive name had been used, then we wouldn't have had the HSW ULT vs HSW CORE bug, fixed by the previous commit. Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: Fix Haswell Core systemsLen Brown
turbostat: cpu0: msr offset 0x630 read failed: Input/output error because Haswell Core does not have C8-C10. Output C8-C10 only on Haswell ULT. Fixes: f5a4c76ad7de ("tools/power turbostat: consolidate duplicate model numbers") Reported-by: Prarit Bhargava <prarit@redhat.com> Suggested-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: add Jacobsville supportZhang Rui
Jacobsville behaves like Denverton. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: fix buffer overrunNaoya Horiguchi
turbostat could be terminated by general protection fault on some latest hardwares which (for example) support 9 levels of C-states and show 18 "tADDED" lines. That bloats the total output and finally causes buffer overrun. So let's extend the buffer to avoid this. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: fix file descriptor leaksGustavo A. R. Silva
Fix file descriptor leaks by closing fp before return. Addresses-Coverity-ID: 1444591 ("Resource leak") Addresses-Coverity-ID: 1444592 ("Resource leak") Fixes: 5ea7647b333f ("tools/power turbostat: Warn on bad ACPI LPIT data") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: fix leak of file descriptor on error return pathColin Ian King
Currently the error return path does not close the file fp and leaks a file descriptor. Fix this by closing the file. Fixes: 5ea7647b333f ("tools/power turbostat: Warn on bad ACPI LPIT data") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: Make interval calculation per thread to reduce jitterYazen Ghannam
Turbostat currently normalizes TSC and other values by dividing by an interval. This interval is the delta between the start of one global (all counters on all CPUs) sampling and the start of another. However, this introduces a lot of jitter into the data. In order to reduce jitter, the interval calculation should be based on timestamps taken per thread and close to the start of the thread's sampling. Define a per thread time value to hold the delta between samples taken on the thread. Use the timestamp taken at the beginning of sampling to calculate the delta. Move the thread's beginning timestamp to after the CPU migration to avoid jitter due to the migration. Use the global time delta for the average time delta. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power turbostat: remove duplicate pc10 columnLen Brown
Remove the duplicate pc10 column. Fixes: be0e54c4ebbf ("turbostat: Build-in "Low Power Idle" counters support") Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power x86_energy_perf_policy: Fix argument parsingZephaniah E. Loss-Cutler-Hull
The -w argument in x86_energy_perf_policy currently triggers an unconditional segfault. This is because the argument string reads: "+a:c:dD:E:e:f:m:M:rt:u:vw" and yet the argument handler expects an argument. When parse_optarg_string is called with a null argument, we then proceed to crash in strncmp, not horribly friendly. The man page describes -w as taking an argument, the long form (--hwp-window) is correctly marked as taking a required argument, and the code expects it. As such, this patch simply marks the short form (-w) as requiring an argument. Signed-off-by: Zephaniah E. Loss-Cutler-Hull <zephaniah@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power: Fix typo in man pageMatt Lupfer
From context, we mean EPB (Enegry Performance Bias). Signed-off-by: Matt Lupfer <mlupfer@ddn.com> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power/x86: Enable compiler optimisations and Fortify by defaultBen Hutchings
Compiling without optimisations is silly, especially since some warnings depend on the optimiser. Use -O2. Fortify adds warnings for unchecked I/O (among other things), which seems to be a good idea for user-space code. Enable that too. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31tools/power x86_energy_perf_policy: Fix "uninitialized variable" warnings at -O2Ben Hutchings
x86_energy_perf_policy first uses __get_cpuid() to check the maximum CPUID level and exits if it is too low. It then assumes that later calls will succeed (which I think is architecturally guaranteed). It also assumes that CPUID works at all (which is not guaranteed on x86_32). If optimisations are enabled, gcc warns about potentially uninitialized variables. Fix this by adding an exit-on-error after every call to __get_cpuid() instead of just checking the maximum level. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Len Brown <len.brown@intel.com>
2019-08-31Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "I2C has a bunch of driver fixes and a core improvement to make the on-going API transition more robust" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: mediatek: disable zero-length transfers for mt8183 i2c: iproc: Stop advertising support of SMBUS quick cmd MAINTAINERS: i2c mv64xxx: Update documentation path i2c: piix4: Fix port selection for AMD Family 16h Model 30h i2c: designware: Synchronize IRQs when unregistering slave client i2c: i801: Avoid memory leak in check_acpi_smo88xx_device() i2c: make i2c_unregister_device() ERR_PTR safe
2019-08-31Merge tag 'trace-v5.3-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Small fixes and minor cleanups for tracing: - Make exported ftrace function not static - Fix NULL pointer dereference in reading probes as they are created - Fix NULL pointer dereference in k/uprobe clean up path - Various documentation fixes" * tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Correct kdoc formats ftrace/x86: Remove mcount() declaration tracing/probe: Fix null pointer dereference tracing: Make exported ftrace_set_clr_event non-static ftrace: Check for successful allocation of hash ftrace: Check for empty hash and comment the race with registering probes ftrace: Fix NULL pointer dereference in t_probe_next()
2019-08-31Merge tag 'riscv/for-v5.3-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fix from Paul Walmsley: "One significant fix for 32-bit RISC-V systems: Fix the RV32 memory map to prevent userspace from corrupting the FIXMAP area. Without this patch, the system can crash very early during the boot" * tag 'riscv/for-v5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: Fix FIXMAP area corruption on RV32 systems
2019-08-31Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Radim Krčmář: "PPC: - Fix bug which could leave locks held in the host on return to a guest. x86: - Prevent infinitely looping emulation of a failing syscall while single stepping. - Do not crash the host when nesting is disabled" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Don't update RIP or do single-step on faulting emulation KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling
2019-08-31Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc mm fixes from Andrew Morton: "7 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: memcontrol: fix percpu vmstats and vmevents flush mm, memcg: do not set reclaim_state on soft limit reclaim mailmap: add aliases for Dmitry Safonov mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n mm: memcontrol: flush percpu slab vmstats on kmem offlining
2019-08-31tracing: Correct kdoc formatsJakub Kicinski
Fix the following kdoc warnings: kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single' kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single' kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single' kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer' kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo' kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo' kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch' kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk' kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk' kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk' Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31ftrace/x86: Remove mcount() declarationJisheng Zhang
Commit 562e14f72292 ("ftrace/x86: Remove mcount support") removed the support for using mcount, so we could remove the mcount() declaration to clean up. Link: http://lkml.kernel.org/r/20190826170150.10f101ba@xhacker.debian Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31tracing/probe: Fix null pointer dereferenceXinpeng Liu
BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0 Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746 trace_probe_cleanup+0x8d/0xd0 free_trace_kprobe.part.14+0x15/0x50 alloc_trace_kprobe+0x23e/0x250 Link: http://lkml.kernel.org/r/1565220563-980-1-git-send-email-danielliu861@gmail.com Fixes: e3dc9f898ef9c ("tracing/probe: Add trace_event_call accesses APIs") Signed-off-by: Xinpeng Liu <danielliu861@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31tracing: Make exported ftrace_set_clr_event non-staticDenis Efremov
The function ftrace_set_clr_event is declared static and marked EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the function was decided to be a part of API, this commit removes the static attribute and adds the declaration to the header. Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com Fixes: f45d1225adb04 ("tracing: Kernel access to Ftrace instances") Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30udp: Remove unlikely() from IS_ERR*() conditionDenis Efremov
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30net/mlx5e: Remove unlikely() from WARN*() conditionDenis Efremov
"unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses unlikely() internally. Signed-off-by: Denis Efremov <efremov@linux.com> Cc: Boris Pismenny <borisp@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a potential crash in the ccp driver" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ccp - Ignore unconfigured CCP device on suspend/resume
2019-08-30Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"Linus Torvalds
Commit dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") made the kfifo code round the number of elements up. That was good for __kfifo_alloc(), but it's actually wrong for __kfifo_init(). The difference? __kfifo_alloc() will allocate the rounded-up number of elements, but __kfifo_init() uses an allocation done by the caller. We can't just say "use more elements than the caller allocated", and have to round down. The good news? All the normal cases will be using power-of-two arrays anyway, and most users of kfifo's don't use kfifo_init() at all, but one of the helper macros to declare a KFIFO that enforce the proper power-of-two behavior. But it looks like at least ibmvscsis might be affected. The bad news? Will Deacon refers to an old thread and points points out that the memory ordering in kfifo's is questionable. See https://lore.kernel.org/lkml/20181211034032.32338-1-yuleixzhang@tencent.com/ for more. Fixes: dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") Reported-by: laokz <laokz@foxmail.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm: memcontrol: fix percpu vmstats and vmevents flushShakeel Butt
Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs. Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm, memcg: do not set reclaim_state on soft limit reclaimMichal Hocko
Adric Blake has noticed[1] the following warning: WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40 [...] Call Trace: mem_cgroup_shrink_node+0x9b/0x1d0 mem_cgroup_soft_limit_reclaim+0x10c/0x3a0 balance_pgdat+0x276/0x540 kswapd+0x200/0x3f0 ? wait_woken+0x80/0x80 kthread+0xfd/0x130 ? balance_pgdat+0x540/0x540 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 ---[ end trace 727343df67b2398a ]--- which tells us that soft limit reclaim is about to overwrite the reclaim_state configured up in the call chain (kswapd in this case but the direct reclaim is equally possible). This means that reclaim stats would get misleading once the soft reclaim returns and another reclaim is done. Fix the warning by dropping set_task_reclaim_state from the soft reclaim which is always called with reclaim_state set up. [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Adric Blake <promarbler14@gmail.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mailmap: add aliases for Dmitry SafonovDmitry Safonov
I don't work for Virtuozzo or Samsung anymore and I've noticed that they have started sending annoying html email-replies. And I prioritize my personal emails over work email box, so while at it add an entry for Arista too - so I can reply faster when needed. Link: http://lkml.kernel.org/r/20190827220346.11123-1-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolateGustavo A. R. Silva
Fix lock/unlock imbalance by unlocking *zhdr* before return. Addresses Coverity ID 1452811 ("Missing unlock") Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor Fixes: d776aaa9895e ("mm/z3fold.c: fix race between migration and destruction") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync ↵Roman Gushchin
with the hierarchical ones" Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters. That's good for displaying in memory.stat, but brings a serious regression into the reclaim process. One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time. Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random. I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch. Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup. Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm/zsmalloc.c: fix build when CONFIG_COMPACTION=nAndrew Morton
Fixes: 701d678599d0c1 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com Reported-by: kbuild test robot <lkp@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30mm: memcontrol: flush percpu slab vmstats on kmem offliningRoman Gushchin
I've noticed that the "slab" value in memory.stat is sometimes 0, even if some children memory cgroups have a non-zero "slab" value. The following investigation showed that this is the result of the kmem_cache reparenting in combination with the per-cpu batching of slab vmstats. At the offlining some vmstat value may leave in the percpu cache, not being propagated upwards by the cgroup hierarchy. It means that stats on ancestor levels are lower than actual. Later when slab pages are released, the precise number of pages is substracted on the parent level, making the value negative. We don't show negative values, 0 is printed instead. To fix this issue, let's flush percpu slab memcg and lruvec stats on memcg offlining. This guarantees that numbers on all ancestor levels are accurate and match the actual number of outstanding slab pages. Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Spurious warning when loading rules using the physdev match, from Todd Seidelmann. 2) Fix FTP conntrack helper debugging output, from Thomas Jarosch. 3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs, from Florian Westphal. 4) Clear skbuff timestamp from the flowtable datapath, also from Florian. 5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-08-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix 32-bit zero-extension during constant blinding which has been causing a regression on ppc64, from Naveen. 2) Fix a latency bug in nfp driver when updating stack index register, from Jiong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30bnxt_en: Fix compile error regression with CONFIG_BNXT_SRIOV not set.Michael Chan
Add a new function bnxt_get_registered_vfs() to handle the work of getting the number of registered VFs under #ifdef CONFIG_BNXT_SRIOV. The main code will call this function and will always work correctly whether CONFIG_BNXT_SRIOV is set or not. Fixes: 230d1f0de754 ("bnxt_en: Handle firmware reset.") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31Merge branch 'bpf-xdp-unaligned-chunk'Daniel Borkmann
Kevin Laatz says: ==================== This patch set adds the ability to use unaligned chunks in the XDP umem. Currently, all chunk addresses passed to the umem are masked to be chunk size aligned (max is PAGE_SIZE). This limits where we can place chunks within the umem as well as limiting the packet sizes that are supported. The changes in this patch set removes these restrictions, allowing XDP to be more flexible in where it can place a chunk within a umem. By relaxing where the chunks can be placed, it allows us to use an arbitrary buffer size and place that wherever we have a free address in the umem. These changes add the ability to support arbitrary frame sizes up to 4k (PAGE_SIZE) and make it easy to integrate with other existing frameworks that have their own memory management systems, such as DPDK. In DPDK, for example, there is already support for AF_XDP with zero-copy. However, with this patch set the integration will be much more seamless. You can find the DPDK AF_XDP driver at: https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp Since we are now dealing with arbitrary frame sizes, we need also need to update how we pass around addresses. Currently, the addresses can simply be masked to 2k to get back to the original address. This becomes less trivial when using frame sizes that are not a 'power of 2' size. This patch set modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr field for an offset value, leaving the lower 48-bits for the address (this leaves us with 256 Terabytes, which should be enough!). We only need to use the upper 16-bits to store the offset when running in unaligned mode. Rather than adding the offset (headroom etc) to the address, we will store it in the upper 16-bits of the address field. This way, we can easily add the offset to the address where we need it, using some bit manipulation and addition, and we can also easily get the original address wherever we need it (for example in i40e_zca_free) by simply masking to get the lower 48-bits of the address field. The patch set was tested with the following set up: - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) - Driver: i40e - Application: xdpsock with l2fwd (single interface) - Turbo disabled in BIOS There are no changes to performance before and after these patches for SKB mode and Copy mode. Zero-copy mode saw a performance degradation of ~1.5%. This patch set has been applied against commit 0bb52b0dfc88 ("tools: bpftool: add 'bpftool map freeze' subcommand") Structure of the patch set: Patch 1: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in i40e. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 2: - Remove unnecessary masking and headroom addition during zero-copy Rx buffer recycling in ixgbe. This change is required in order for the buffer recycling to work in the unaligned chunk mode. Patch 3: - Add infrastructure for unaligned chunks. Since we are dealing with unaligned chunks that could potentially cross a physical page boundary, we add checks to keep track of that information. We can later use this information to correctly handle buffers that are placed at an address where they cross a page boundary. This patch also modifies the existing Rx and Tx functions to use the new descriptor format. To handle addresses correctly, we need to mask appropriately based on whether we are in aligned or unaligned mode. Patch 4: - This patch updates the i40e driver to make use of the new descriptor format. Patch 5: - This patch updates the ixgbe driver to make use of the new descriptor format. Patch 6: - This patch updates the mlx5e driver to make use of the new descriptor format. These changes are required to handle the new descriptor format and for unaligned chunks support. Patch 7: - This patch allows XSK frames smaller than page size in the mlx5e driver. Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. Patch 8: - Add flags for umem configuration to libbpf. Since we increase the size of the struct by adding flags, we also need to add the ABI versioning in this patch. Patch 9: - Modify xdpsock application to add a command line option for unaligned chunks Patch 10: - Since we can now run the application in unaligned chunk mode, we need to make sure we recycle the buffers appropriately. Patch 11: - Adds hugepage support to the xdpsock application Patch 12: - Documentation update to include the unaligned chunk scenario. We need to explicitly state that the incoming addresses are only masked in the aligned chunk mode and not the unaligned chunk mode. v2: - fixed checkpatch issues - fixed Rx buffer recycling for unaligned chunks in xdpsock - removed unused defines - fixed how chunk_size is calculated in xsk_diag.c - added some performance numbers to cover letter - modified descriptor format to make it easier to retrieve original address - removed patch adding off_t off to the zero copy allocator. This is no longer needed with the new descriptor format. v3: - added patch for mlx5 driver changes needed for unaligned chunks - moved offset handling to new helper function - changed value used for the umem chunk_mask. Now using the new descriptor format to save us doing the calculations in a number of places meaning more of the code is left unchanged while adding unaligned chunk support. v4: - reworked the next_pg_contig field in the xdp_umem_page struct. We now use the low 12 bits of the addr for flags rather than adding an extra field in the struct. - modified unaligned chunks flag define - fixed page_start calculation in __xsk_rcv_memcpy(). - move offset handling to the xdp_umem_get_* functions - modified the len field in xdp_umem_reg struct. We now use 16 bits from this for the flags field. - fixed headroom addition to handle in the mlx5e driver - other minor changes based on review comments v5: - Added ABI versioning in the libbpf patch - Removed bitfields in the xdp_umem_reg struct. Adding new flags field. - Added accessors for getting addr and offset. - Added helper function for adding the offset to the addr. - Fixed conflicts with 'bpf-af-xdp-wakeup' which was merged recently. - Fixed typo in mlx driver patch. - Moved libbpf patch to later in the set (7/11, just before the sample app changes) v6: - Added support for XSK frames smaller than page in mlx5e driver (Maxim Mikityanskiy <maximmi@mellanox.com). - Fixed offset handling in xsk_generic_rcv. - Added check for base address in xskq_is_valid_addr_unaligned. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31doc/af_xdp: include unaligned chunk caseKevin Laatz
The addition of unaligned chunks mode, the documentation needs to be updated to indicate that the incoming addr to the fill ring will only be masked if the user application is run in the aligned chunk mode. This patch also adds a line to explicitly indicate that the incoming addr will not be masked if running the user application in the unaligned chunk mode. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: use hugepages in xdpsock appKevin Laatz
This patch modifies xdpsock to use mmap instead of posix_memalign. With this change, we can use hugepages when running the application in unaligned chunks mode. Using hugepages makes it more likely that we have physically contiguous memory, which supports the unaligned chunk mode better. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: add buffer recycling for unaligned chunks to xdpsockKevin Laatz
This patch adds buffer recycling support for unaligned buffers. Since we don't mask the addr to 2k at umem_reg in unaligned mode, we need to make sure we give back the correct (original) addr to the fill queue. We achieve this using the new descriptor format and associated masks. The new format uses the upper 16-bits for the offset and the lower 48-bits for the addr. Since we have a field for the offset, we no longer need to modify the actual address. As such, all we have to do to get back the original address is mask for the lower 48 bits (i.e. strip the offset and we get the address on it's own). Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31samples/bpf: add unaligned chunks mode support to xdpsockKevin Laatz
This patch adds support for the unaligned chunks mode. The addition of the unaligned chunks option will allow users to run the application with more relaxed chunk placement in the XDP umem. Unaligned chunks mode can be used with the '-u' or '--unaligned' command line options. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31libbpf: add flags to umem configKevin Laatz
This patch adds a 'flags' field to the umem_config and umem_reg structs. This will allow for more options to be added for configuring umems. The first use for the flags field is to add a flag for unaligned chunks mode. These flags can either be user-provided or filled with a default. Since we change the size of the xsk_umem_config struct, we need to version the ABI. This patch includes the ABI versioning for xsk_umem__create. The Makefile was also updated to handle multiple function versions in check-abi. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31net/mlx5e: Allow XSK frames smaller than a pageMaxim Mikityanskiy
Relax the requirements to the XSK frame size to allow it to be smaller than a page and even not a power of two. The current implementation can work in this mode, both with Striding RQ and without it. The code that checks `mtu + headroom <= XSK frame size` is modified accordingly. Any frame size between 2048 and PAGE_SIZE is accepted. Functions that worked with pages only now work with XSK frames, even if their size is different from PAGE_SIZE. With XSK queues, regardless of the frame size, Striding RQ uses the stride size of PAGE_SIZE, and UMR MTTs are posted using starting addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no packet data will overlap with other frames. UMR MTT size is made equal to the stride size of the RQ, because UMEM frames may come in random order, and we need to handle them one by one. PAGE_SIZE is just a power of two that is bigger than any allowed XSK frame size, and also it doesn't require making additional changes to the code. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31mlx5e: modify driver for handling offsetsKevin Laatz
With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31ixgbe: modify driver for handling offsetsKevin Laatz
With the addition of the unaligned chunks option, we need to make sure we handle the offsets accordingly based on the mode we are currently running in. This patch modifies the driver to appropriately mask the address for each case. Signed-off-by: Kevin Laatz <kevin.laatz@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>