summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-10-15perf trace: Add syscall failure stats to -s/--summary and -S/--with-summaryArnaldo Carvalho de Melo
Just like strace has: # trace -s sleep 1 Summary of events: sleep (32370), 80 events, 93.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ nanosleep 1 0 1000.402 1000.402 1000.402 1000.402 0.00% mmap 8 0 0.023 0.002 0.003 0.004 8.49% close 5 0 0.015 0.001 0.003 0.009 51.39% mprotect 4 0 0.014 0.002 0.003 0.005 16.95% openat 3 0 0.013 0.003 0.004 0.005 14.29% munmap 1 0 0.010 0.010 0.010 0.010 0.00% read 4 0 0.005 0.001 0.001 0.002 16.83% brk 4 0 0.004 0.001 0.001 0.002 20.82% access 1 1 0.004 0.004 0.004 0.004 0.00% fstat 3 0 0.003 0.001 0.001 0.001 12.17% lseek 3 0 0.003 0.001 0.001 0.001 11.45% arch_prctl 2 1 0.002 0.001 0.001 0.001 2.30% execve 1 0 0.000 0.000 0.000 0.000 0.00% # # perf trace -S sleep 1 ? ... [continued]: execve()) = 0 0.028 brk(brk: NULL) = 0x559f5bd96000 0.033 arch_prctl(option: 0x3001, arg2: 0x7ffda8b715a0) = -1 EINVAL (Invalid argument) 0.046 access(filename: "/etc/ld.so.preload", mode: R) = -1 ENOENT (No such file or directory) 0.055 openat(dfd: CWD, filename: "/etc/ld.so.cache", flags: RDONLY|CLOEXEC) = 3 0.060 fstat(fd: 3, statbuf: 0x7ffda8b707a0) = 0 0.062 mmap(addr: NULL, len: 134346, prot: READ, flags: PRIVATE, fd: 3, off: 0) = 0x7f3aedfc4000 0.066 close(fd: 3) = 0 0.079 openat(dfd: CWD, filename: "/lib64/libc.so.6", flags: RDONLY|CLOEXEC) = 3 0.085 read(fd: 3, buf: 0x7ffda8b70948, count: 832) = 832 0.088 lseek(fd: 3, offset: 792, whence: SET) = 792 0.090 read(fd: 3, buf: 0x7ffda8b70810, count: 68) = 68 0.093 fstat(fd: 3, statbuf: 0x7ffda8b707f0) = 0 0.095 mmap(addr: NULL, len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7f3aedfc2000 0.101 lseek(fd: 3, offset: 792, whence: SET) = 792 0.103 read(fd: 3, buf: 0x7ffda8b70450, count: 68) = 68 0.105 lseek(fd: 3, offset: 864, whence: SET) = 864 0.107 read(fd: 3, buf: 0x7ffda8b70470, count: 32) = 32 0.110 mmap(addr: NULL, len: 1857472, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3, off: 0) = 0x7f3aeddfc000 0.114 mprotect(start: 0x7f3aede1e000, len: 1679360, prot: NONE) = 0 0.121 mmap(addr: 0x7f3aede1e000, len: 1363968, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x22000) = 0x7f3aede1e000 0.127 mmap(addr: 0x7f3aedf6b000, len: 311296, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x16f000) = 0x7f3aedf6b000 0.131 mmap(addr: 0x7f3aedfb8000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1bb000) = 0x7f3aedfb8000 0.138 mmap(addr: 0x7f3aedfbe000, len: 14272, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7f3aedfbe000 0.147 close(fd: 3) = 0 0.158 arch_prctl(option: SET_FS, arg2: 0x7f3aedfc3580) = 0 0.210 mprotect(start: 0x7f3aedfb8000, len: 16384, prot: READ) = 0 0.230 mprotect(start: 0x559f5b27d000, len: 4096, prot: READ) = 0 0.236 mprotect(start: 0x7f3aee00f000, len: 4096, prot: READ) = 0 0.240 munmap(addr: 0x7f3aedfc4000, len: 134346) = 0 0.300 brk(brk: NULL) = 0x559f5bd96000 0.302 brk(brk: 0x559f5bdb7000) = 0x559f5bdb7000 0.305 brk(brk: NULL) = 0x559f5bdb7000 0.310 openat(dfd: CWD, filename: "/usr/lib/locale/locale-archive", flags: RDONLY|CLOEXEC) = 3 0.315 fstat(fd: 3, statbuf: 0x7f3aedfbdac0) = 0 0.318 mmap(addr: NULL, len: 217750512, prot: READ, flags: PRIVATE, fd: 3, off: 0) = 0x7f3ae0e52000 0.325 close(fd: 3) = 0 0.358 nanosleep(rqtp: 0x7ffda8b714b0, rmtp: NULL) = 0 1000.622 close(fd: 1) = 0 1000.641 close(fd: 2) = 0 1000.664 exit_group(error_code: 0) = ? Summary of events: sleep (722), 80 events, 93.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ nanosleep 1 0 1000.194 1000.194 1000.194 1000.194 0.00% mmap 8 0 0.025 0.002 0.003 0.005 10.17% close 5 0 0.018 0.001 0.004 0.010 50.18% mprotect 4 0 0.016 0.003 0.004 0.006 16.81% openat 3 0 0.011 0.003 0.004 0.004 6.57% munmap 1 0 0.010 0.010 0.010 0.010 0.00% brk 4 0 0.005 0.001 0.001 0.002 20.72% read 4 0 0.005 0.001 0.001 0.002 16.71% access 1 1 0.005 0.005 0.005 0.005 0.00% fstat 3 0 0.004 0.001 0.001 0.002 14.82% lseek 3 0 0.003 0.001 0.001 0.001 11.66% arch_prctl 2 1 0.002 0.001 0.001 0.001 3.59% execve 1 0 0.000 0.000 0.000 0.000 0.00% # Works for system wide, e.g. for 1ms: # perf trace -s -a sleep 0.001 Summary of events: sleep (768), 94 events, 37.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ nanosleep 1 0 1.133 1.133 1.133 1.133 0.00% execve 7 6 0.351 0.003 0.050 0.316 88.53% mmap 8 0 0.024 0.002 0.003 0.004 8.86% mprotect 4 0 0.017 0.003 0.004 0.006 16.02% openat 3 0 0.013 0.004 0.004 0.005 8.34% munmap 1 0 0.010 0.010 0.010 0.010 0.00% brk 4 0 0.007 0.001 0.002 0.002 10.99% close 5 0 0.005 0.001 0.001 0.002 11.69% read 5 0 0.005 0.000 0.001 0.002 30.53% access 1 1 0.004 0.004 0.004 0.004 0.00% fstat 3 0 0.004 0.001 0.001 0.002 10.74% lseek 3 0 0.003 0.001 0.001 0.001 10.20% arch_prctl 2 1 0.002 0.001 0.001 0.001 3.34% Web Content (21258), 46 events, 18.5% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ recvmsg 12 12 0.015 0.001 0.001 0.002 8.50% futex 2 0 0.008 0.003 0.004 0.005 27.08% poll 6 0 0.006 0.000 0.001 0.002 22.14% read 2 0 0.006 0.002 0.003 0.003 26.08% write 1 0 0.002 0.002 0.002 0.002 0.00% Web Content (4365), 36 events, 14.5% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ recvmsg 10 10 0.015 0.001 0.002 0.003 11.83% poll 5 0 0.006 0.000 0.001 0.002 28.44% futex 2 0 0.005 0.001 0.003 0.004 48.29% read 1 0 0.003 0.003 0.003 0.003 0.00% Timer (21275), 14 events, 5.6% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ futex 6 1 0.240 0.000 0.040 0.149 64.58% write 1 0 0.008 0.008 0.008 0.008 0.00% Timer (4383), 14 events, 5.6% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ futex 6 2 0.186 0.000 0.031 0.181 96.45% write 1 0 0.010 0.010 0.010 0.010 0.00% Web Content (20354), 28 events, 11.3% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ recvmsg 8 8 0.010 0.001 0.001 0.002 15.24% poll 4 0 0.004 0.000 0.001 0.002 35.68% futex 1 0 0.003 0.003 0.003 0.003 0.00% read 1 0 0.003 0.003 0.003 0.003 0.00% Timer (20371), 10 events, 4.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ futex 4 1 0.077 0.000 0.019 0.075 95.46% write 1 0 0.005 0.005 0.005 0.005 0.00% [root@quaco ~]# Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-k7kh2muo5oeg56yx446hnw9v@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf stat: Support --all-kernel/--all-userJin Yao
'perf record' has supported --all-kernel / --all-user to configure all used events to run in kernel space or run in user space. But 'perf stat' doesn't support these options. It would be useful to support these options in 'perf stat' too to keep the same semantics available in both tools. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20191011050545.3899-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()Thomas Richter
The build of file libperf-jvmti.so succeeds but the resulting object fails to load: # ~/linux/tools/perf/perf record -k mono -- java \ -XX:+PreserveFramePointer \ -agentpath:/root/linux/tools/perf/libperf-jvmti.so \ hog 100000 123450 Error occurred during initialization of VM Could not find agent library /root/linux/tools/perf/libperf-jvmti.so in absolute path, with error: /root/linux/tools/perf/libperf-jvmti.so: undefined symbol: _ctype Add the missing _ctype symbol into the build script. Fixes: 79743bc927f6 ("perf jvmti: Link against tools/lib/string.o to have weak strlcpy()") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: http://lore.kernel.org/lkml/20191008093841.59387-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf annotate: Fix objdump --no-show-raw-insn flagIan Rogers
Remove redirection of objdump's stderr to /dev/null to help diagnose failures. Fix the '--no-show-raw' flag to be '--no-show-raw-insn' which binutils is permissive and allows, but fails with LLVM objdump. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20191010183649.23768-6-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf annotate: Don't pipe objdump output through 'expand' commandIan Rogers
Avoiding a pipe allows objdump command failures to surface. Move to the caller of symbol__parse_objdump_line the call to strim that removes leading and trailing tabs. Add a new expand_tabs function that if a tab is present allocate a new line in which tabs are expanded. In symbol__parse_objdump_line the line had no leading spaces, so simplify the line_ip processing. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20191010183649.23768-5-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf annotate: Don't pipe objdump output through 'grep' commandIan Rogers
Simplify the objdump command by not piping the output of objdump through grep. Instead, drop lines that match the grep pattern during the reading loop. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20191010183649.23768-4-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf annotate: Use libsubcmd's run-command.h to fork objdumpIan Rogers
Reduce duplicated logic by using the subcmd library. Ensure when errors occur they are reported to the caller. Before this patch, if no lines are read the error status is 0. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20191010183649.23768-3-irogers@google.com Link: http://lore.kernel.org/lkml/20191015003418.62563-1-irogers@google.com [ merged follow up fix for NULL termination as in the 2nd link above ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf annotate: Avoid reallocation in objdump parsingIan Rogers
Objdump output is parsed using getline which allocates memory for the read. Getline will realloc if the memory is too small, but currently the line is always freed after the call. Simplify parse_objdump_line by performing the reading in symbol__disassemble. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20191010183649.23768-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf report: Add warning when libunwind not compiled inJin Yao
We received a user report that call-graph DWARF mode was enabled in 'perf record' but 'perf report' didn't unwind the callstack correctly. The reason was, libunwind was not compiled in. We can use 'perf -vv' to check the compiled libraries but it would be valuable to report a warning to user directly (especially valuable for a perf newbie). The warning is: Warning: Please install libunwind development packages during the perf build. Both TUI and stdio are supported. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20191011022122.26369-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf test: Avoid infinite loop for task exit caseLeo Yan
When executing the task exit testing case, perf gets stuck in an endless loop this case and doesn't return back on Arm64 Juno board. After digging into this issue, since Juno board has Arm's big.LITTLE CPUs, thus the PMUs are not compatible between the big CPUs and little CPUs. This leads to a PMU event that cannot be enabled properly when the traced task is migrated from one variant's CPU to another variant. Finally, the test case runs into infinite loop for cannot read out any event data after return from polling. Eventually, we need to work out formal solution to allow PMU events can be freely migrated from one CPU variant to another, but this is a difficult task and a different topic. This patch tries to fix the Perf test case to avoid infinite loop, when the testing detects 1000 times retrying for reading empty events, it will directly bail out and return failure. This allows the Perf tool can continue its other test cases. Signed-off-by: Leo Yan <leo.yan@linaro.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20191011091942.29841-2-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf test: Report failure for mmap eventsLeo Yan
When fail to mmap events in task exit case, it misses to set 'err' to -1; thus the testing will not report failure for it. This patch sets 'err' to -1 when fails to mmap events, thus Perf tool can report correct result. Fixes: d723a55096b8 ("perf test: Add test case for checking number of EXIT events") Signed-off-by: Leo Yan <leo.yan@linaro.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20191011091942.29841-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf evlist: Fix fix for freed id arraysAndi Kleen
In the earlier fix for the memory overrun of id arrays I managed to typo the wrong event in the fix. Of course we need to close the current event in the loop, not the original failing event. The same test case as in the original patch still passes. Fixes: 7834fa948beb ("perf evlist: Fix access of freed id arrays") Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: http://lore.kernel.org/lkml/20191011182140.8353-2-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf script: Fix --reltime with --timeAndi Kleen
My earlier patch to just enable --reltime with --time was a little too optimistic. The --time parsing would accept absolute time, which is very confusing to the user. Support relative time in --time parsing too. This only works with recent perf record that records the first sample time. Otherwise we error out. Fixes: 3714437d3fcc ("perf script: Allow --time with --reltime") Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Link: http://lore.kernel.org/lkml/20191011182140.8353-1-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15perf tools: Allow to build with -ltcmallocJiri Olsa
By using "make TCMALLOC=1" you can enable perf to be build for usage with libtcmalloc.so (gperftools). Get heap profile (tools/perf directory): $ <install gperftools> $ make TCMALLOC=1 DEBUG=1 $ HEAPPROFILE=/tmp/heapprof ./perf ... $ pprof ./perf /tmp/heapprof.000* (pprof) top Total: 2335.5 MB 1735.1 74.3% 74.3% 1735.1 74.3% memdup 402.0 17.2% 91.5% 402.0 17.2% zalloc 140.2 6.0% 97.5% 145.8 6.2% map__new 33.6 1.4% 98.9% 33.6 1.4% symbol__new 12.4 0.5% 99.5% 12.4 0.5% alloc_event 6.2 0.3% 99.7% 6.2 0.3% nsinfo__new 5.5 0.2% 100.0% 5.5 0.2% nsinfo__copy 0.3 0.0% 100.0% 0.3 0.0% dso__new 0.1 0.0% 100.0% 0.1 0.0% do_read_string 0.0 0.0% 100.0% 0.0 0.0% __GI__IO_file_doallocate See callstack: $ pprof --pdf ./perf /tmp/heapprof.00* > callstack.pdf $ pprof --web ./perf /tmp/heapprof.00* Committer testing: Install gperftools, on fedora: # dnf install gperftools-devel Then build: $ make TCMALLOC=1 DEBUG=1 -C tools/perf O=/tmp/build/perf install-bin Verify that it linked against the right library: $ ldd ~/bin/perf | grep tcma libtcmalloc.so.4 => /lib64/libtcmalloc.so.4 (0x00007fb2953a7000) $ Run 'perf trace' system wide for 1 minute: # HEAPPROFILE=/tmp/heapprof perf trace -a sleep 1m <SNIP> 59985.524 ( 0.006 ms): Web Content/20354 recvmsg(fd: 9<socket:[1762817]>, msg: 0x7ffee5fdafb0) = -1 EAGAIN (Resource temporarily unavailable) 59985.536 ( 0.005 ms): Web Content/20354 recvmsg(fd: 9<socket:[1762817]>, msg: 0x7ffee5fdafc0) = -1 EAGAIN (Resource temporarily unavailable) 59981.956 (10.143 ms): SCTP timer/21716 ... [continued]: select()) = 0 (Timeout) 59985.549 ( ): Web Content/20354 poll(ufds: 0x7f1df38af180, nfds: 3, timeout_msecs: 4294967295) ... 0.926 (59999.481 ms): sleep/29764 ... [continued]: nanosleep()) = 0 59992.133 ( ): SCTP timer/21716 select(tvp: 0x7ff5bf7fee80) ... 60000.477 ( 0.009 ms): sleep/29764 close(fd: 1) = 0 60000.493 ( 0.005 ms): sleep/29764 close(fd: 2) = 0 60000.514 ( ): sleep/29764 exit_group() = ? Dumping heap profile to /tmp/heapprof.0001.heap (Exiting, 3 MB in use) [root@quaco ~]# Install pprof: # dnf install pprof And run it: # pprof ~/bin/perf /tmp/heapprof.0001.heap Using local file /root/bin/perf. Using local file /tmp/heapprof.0001.heap. Welcome to pprof! For help, type 'help'. (pprof) top Total: 4.0 MB 1.7 42.0% 42.0% 2.2 54.1% map__new 0.9 23.3% 65.3% 0.9 23.3% zalloc 0.5 11.4% 76.7% 0.5 11.4% dso__new 0.2 5.6% 82.3% 0.3 8.5% trace__sys_enter 0.2 4.9% 87.2% 0.2 4.9% __GI___strdup 0.2 3.8% 91.0% 0.2 3.8% new_term 0.1 2.2% 93.2% 0.4 10.1% __perf_pmu__new_alias 0.0 1.0% 94.3% 0.0 1.2% event_read_fields 0.0 0.8% 95.1% 0.0 0.8% nsinfo__new 0.0 0.7% 95.8% 0.1 3.2% trace__read_syscall_info (pprof) Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20191013151427.11941-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-10-15iommu/ipmmu-vmsa: Only call platform_get_irq() when interrupt is mandatoryGeert Uytterhoeven
As platform_get_irq() now prints an error when the interrupt does not exist, calling it gratuitously causes scary messages like: ipmmu-vmsa e6740000.mmu: IRQ index 0 not found Fix this by moving the call to platform_get_irq() down, where the existence of the interrupt is mandatory. Fixes: 7723f4c5ecdb8d83 ("driver core: platform: Add an error message to platform_get_irq*()") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Tested-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-10-15soundwire: depend on ACPI || OFMichal Suchanek
Now devicetree is supported for probing soundwire as well. On platforms built with !ACPI !OF (ie s390x) the device still cannot be probed and gives a build warning. Cc: stable@vger.kernel.org Fixes: a2e484585ad3 ("soundwire: core: add device tree support for slave devices") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Link: https://lore.kernel.org/r/0b89b4ea16a93f523105c81a2f718b0cd7ec66f2.1570097621.git.msuchanek@suse.de Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-15soundwire: depend on ACPIMichal Suchanek
The device cannot be probed on !ACPI and gives this warning: drivers/soundwire/slave.c:16:12: warning: ‘sdw_slave_add’ defined but not used [-Wunused-function] static int sdw_slave_add(struct sdw_bus *bus, ^~~~~~~~~~~~~ Cc: stable@vger.kernel.org Fixes: 7c3cd189b86d ("soundwire: Add Master registration") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Link: https://lore.kernel.org/r/bd685232ea511251eeb9554172f1524eabf9a46e.1570097621.git.msuchanek@suse.de Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-15iommu/rockchip: Don't use platform_get_irq to implicitly count irqsHeiko Stuebner
Till now the Rockchip iommu driver walked through the irq list via platform_get_irq() until it encountered an ENXIO error. With the recent change to add a central error message, this always results in such an error for each iommu on probe and shutdown. To not confuse people, switch to platform_count_irqs() to get the actual number of interrupts before walking through them. Fixes: 7723f4c5ecdb ("driver core: platform: Add an error message to platform_get_irq*()") Signed-off-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-10-15dmaengine: sprd: Fix the possible memory leak issueBaolin Wang
If we terminate the channel to free all descriptors associated with this channel, we will leak the memory of current descriptor if the current descriptor is not completed, since it had been deteled from the desc_issued list and have not been added into the desc_completed list. Thus we should check if current descriptor is completed or not, when freeing the descriptors associated with one channel, if not, we should free it to avoid this issue. Fixes: 9b3b8171f7f4 ("dmaengine: sprd: Add Spreadtrum DMA driver") Reported-by: Zhenfang Wang <zhenfang.wang@unisoc.com> Tested-by: Zhenfang Wang <zhenfang.wang@unisoc.com> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> Link: https://lore.kernel.org/r/170dbbc6d5366b6fa974ce2d366652e23a334251.1570609788.git.baolin.wang@linaro.org Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-15dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_configRadhey Shyam Pandey
In vdma_channel_set_config clear the delay, frame count and master mask before updating their new values. It avoids programming incorrect state when input parameters are different from default. Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Acked-by: Appana Durga Kedareswara rao <appana.durga.rao@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/1569495060-18117-3-git-send-email-radhey.shyam.pandey@xilinx.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-15dmaengine: xilinx_dma: Fix 64-bit simple AXIDMA transferRadhey Shyam Pandey
In AXI DMA simple mode also pass MSB bits of source and destination address to xilinx_write function. It fixes simple AXI DMA operation mode using 64-bit addressing. Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Link: https://lore.kernel.org/r/1569495060-18117-2-git-send-email-radhey.shyam.pandey@xilinx.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-15x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpuSean Christopherson
Check that the per-cpu cluster mask pointer has been set prior to clearing a dying cpu's bit. The per-cpu pointer is not set until the target cpu reaches smp_callin() during CPUHP_BRINGUP_CPU, whereas the teardown function, x2apic_dead_cpu(), is associated with the earlier CPUHP_X2APIC_PREPARE. If an error occurs before the cpu is awakened, e.g. if do_boot_cpu() itself fails, x2apic_dead_cpu() will dereference the NULL pointer and cause a panic. smpboot: do_boot_cpu failed(-22) to wakeup CPU#1 BUG: kernel NULL pointer dereference, address: 0000000000000008 RIP: 0010:x2apic_dead_cpu+0x1a/0x30 Call Trace: cpuhp_invoke_callback+0x9a/0x580 _cpu_up+0x10d/0x140 do_cpu_up+0x69/0xb0 smp_init+0x63/0xa9 kernel_init_freeable+0xd7/0x229 ? rest_init+0xa0/0xa0 kernel_init+0xa/0x100 ret_from_fork+0x35/0x40 Fixes: 023a611748fd5 ("x86/apic/x2apic: Simplify cluster management") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191001205019.5789-1-sean.j.christopherson@intel.com
2019-10-15x86/hyperv: Make vapic support x2apic modeRoman Kagan
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode when supported by the vcpus. However, the apic access functions for Hyper-V enlightened apic assume xapic mode only. As a result, Linux fails to bring up secondary cpus when run as a guest in QEMU/KVM with both hv_apic and x2apic enabled. According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic apic MSRs behave exactly the same as the corresponding architectural x2apic MSRs, so there's no need to override the apic accessors. The only exception is hv_apic_eoi_write, which benefits from lazy EOI when available; however, its implementation works for both xapic and x2apic modes. Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver") Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access") Suggested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191010123258.16919-1-rkagan@virtuozzo.com
2019-10-15Merge tag 'perf-core-for-mingo-5.5-20191011' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: perf trace: Arnaldo Carvalho de Melo: - Reuse the strace-like syscall_arg_fmt->scnprintf() beautification routines (convert integer arguments into strings, like open flags, etc) in tracepoint arguments. For now the type based scnprintf routines (pid_t, umode_t, etc) and the ones based in well known arg name based ("fd", etc) gets associated with tracepoint args of that type. A tracepoint only arg, "msr", for the msr:{write,read}_msr gets added as an initial step. - Introduce syscall_arg_fmt->strtoul() methods to be the reverse operation of ->scnprintf(), i.e. to go from a string to an integer. - Implement --filter, just like in 'perf record', that affects the tracepoint events specied thus far in the command line, use the ->strtoul() methods to allow strings in tables associated with beautifiers to the integers the in-kernel tracepoint (eBPF later) filters expect, e.g.: # perf trace --max-events 1 -e sched:*ipi --filter="cpu==1 || cpu==2" 0.000 as/24630 sched:sched_wake_idle_without_ipi(cpu: 1) # # perf trace --max-events 1 --max-stack=32 -e msr:* --filter="msr==IA32_TSC_DEADLINE" 207.000 cc1/19963 msr:write_msr(msr: IA32_TSC_DEADLINE, val: 5442316760822) do_trace_write_msr ([kernel.kallsyms]) do_trace_write_msr ([kernel.kallsyms]) lapic_next_deadline ([kernel.kallsyms]) clockevents_program_event ([kernel.kallsyms]) hrtimer_interrupt ([kernel.kallsyms]) smp_apic_timer_interrupt ([kernel.kallsyms]) apic_timer_interrupt ([kernel.kallsyms]) [0x6ff66c] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x7047c3] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x707708] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) execute_one_pass (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x4f3d37] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x4f3d49] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) execute_pass_list (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) cgraph_node::expand (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x2625b4] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) symbol_table::finalize_compilation_unit (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x5ae8b9] (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) toplev::main (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) main (/usr/lib/gcc-cross/alpha-linux-gnu/8/cc1) [0x26b6a] (/usr/lib/x86_64-linux-gnu/libc-2.29.so) # # perf trace --max-events 8 -e msr:* --filter="msr==IA32_SPEC_CTRL" 0.000 :13281/13281 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) 0.063 migration/3/25 msr:write_msr(msr: IA32_SPEC_CTRL) 0.217 kworker/u16:1-/4826 msr:write_msr(msr: IA32_SPEC_CTRL) 0.687 rcu_sched/11 msr:write_msr(msr: IA32_SPEC_CTRL) 0.696 :13280/13280 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) 0.305 :13281/13281 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) 0.355 :13274/13274 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) 2.743 kworker/u16:0-/6711 msr:write_msr(msr: IA32_SPEC_CTRL) # # perf trace --max-events 8 --cpu 1 -e msr:* --filter="msr!=IA32_SPEC_CTRL && msr!=IA32_TSC_DEADLINE && msr != FS_BASE" 0.000 mtr-packet/30819 msr:write_msr(msr: 0x830, val: 68719479037) 0.096 :0/0 msr:read_msr(msr: IA32_TSC_ADJUST) 238.925 mtr-packet/30819 msr:write_msr(msr: 0x830, val: 8589936893) 511.010 :0/0 msr:write_msr(msr: 0x830, val: 68719479037) 1005.052 :0/0 msr:read_msr(msr: IA32_TSC_ADJUST) 1235.131 CPU 0/KVM/3750 msr:write_msr(msr: 0x830, val: 4294969595) 1235.195 CPU 0/KVM/3750 msr:read_msr(msr: IA32_SYSENTER_ESP, val: -2199023037952) 1235.201 CPU 0/KVM/3750 msr:read_msr(msr: IA32_APICBASE, val: 4276096000) # - Default to not using libtraceevent and its plugins for beautifying tracepoint arguments, since now we're reusing the strace-like beatufiers. Use --libtraceevent_print (using just --libtrace is unambiguous and can be used as a short hand) to go back to those beautifiers. This will help in the transition, as can be seen in some of the sched tracepoints that still need some work in the libbeauty based mode: # trace --no-inherit -e msr:*,*sleep,sched:* sleep 1 0.000 ( ): sched:sched_waking(comm: "trace", pid: 3319 (trace), prio: 120, success: 1) 0.006 ( ): sched:sched_wakeup(comm: "trace", pid: 3319 (trace), prio: 120, success: 1) 0.348 ( ): sched:sched_process_exec(filename: 140212596720100, pid: 3319 (sleep), old_pid: 3319 (sleep)) 0.490 ( ): msr:write_msr(msr: FS_BASE, val: 139631189321088) 0.670 ( ): nanosleep(rqtp: 0x7ffc52c23bc0) ... 0.674 ( ): sched:sched_stat_runtime(comm: "sleep", pid: 3319 (sleep), runtime: 659259, vruntime: 78942418342) 0.675 ( ): sched:sched_switch(prev_comm: "sleep", prev_pid: 3319 (sleep), prev_prio: 120, prev_state: 1, next_comm: "swapper/0", next_prio: 120) 1001.059 ( ): sched:sched_waking(comm: "sleep", pid: 3319 (sleep), prio: 120, success: 1) 1001.098 ( ): sched:sched_wakeup(comm: "sleep", pid: 3319 (sleep), prio: 120, success: 1) 0.670 (1000.504 ms): ... [continued]: nanosleep()) = 0 1001.456 ( ): sched:sched_process_exit(comm: "sleep", pid: 3319 (sleep), prio: 120) # trace --libtrace --no-inherit -e msr:*,*sleep,sched:* sleep 1 # trace --libtrace --no-inherit -e msr:*,*sleep,sched:* sleep 1 0.000 ( ): sched:sched_waking(comm=trace pid=3323 prio=120 target_cpu=000) 0.007 ( ): sched:sched_wakeup(comm=trace pid=3323 prio=120 target_cpu=000) 0.382 ( ): sched:sched_process_exec(filename=/usr/bin/sleep pid=3323 old_pid=3323) 0.525 ( ): msr:write_msr(c0000100, value 7f5d508a0580) 0.713 ( ): nanosleep(rqtp: 0x7fff487fb4a0) ... 0.717 ( ): sched:sched_stat_runtime(comm=sleep pid=3323 runtime=617722 [ns] vruntime=78957731636 [ns]) 0.719 ( ): sched:sched_switch(prev_comm=sleep prev_pid=3323 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120) 1001.117 ( ): sched:sched_waking(comm=sleep pid=3323 prio=120 target_cpu=000) 1001.157 ( ): sched:sched_wakeup(comm=sleep pid=3323 prio=120 target_cpu=000) 0.713 (1000.522 ms): ... [continued]: nanosleep()) = 0 1001.538 ( ): sched:sched_process_exit(comm=sleep pid=3323 prio=120) # - Make -v (verbose) mode be honoured for .perfconfig based trace.add_events, to help in diagnosing problems with building eBPF events (-e source.c). - When using eBPF syscall payload augmentation do not show strace-like syscalls when all the user specified was some tracepoint event, bringing the behaviour in line with that of when not using eBPF augmentation. Intel PT: exported-sql-viewer GUI: Adrian Hunter: - Add LookupModel, HBoxLayout, VBoxLayout, global time range calculations so as to add a time chart by CPU. perf script: Andi Kleen: - Allow --time (to specify a time span of interest) with --reltime perf diff: Jin Yao: - Report noise for cycles diff, i.e. a histogram + stddev. (timestamps relative to start). perf annotate: Arnaldo Carvalho de Melo: - Initialize env->cpuid when running in live mode (perf top), as it is used in some of the per arch annotation init routines. samples bpf: Björn Töpel: - Fixup fallout of using tools/perf/perf-sys. from outside tools/perf. Core: Ian Rogers: - Avoid 'sample_reg_masks' being const + weak, as this breaks with some compilers that constant-propagate from the weak symbol. libperf: - First part of moving the perf_mmap class from tools/perf to libperf. - Propagate CFLAGS to libperf from the tools/perf Makefile. Vendor events: John Garry: - Add entry in MAINTAINERS with reviewers for the for perf tool arm64 pmu-events files. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-15KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in useGreg Kurz
Connecting a vCPU to a XIVE KVM device means establishing a 1:1 association between a vCPU id and the offset (VP id) of a VP structure within a fixed size block of VPs. We currently try to enforce the 1:1 relationship by checking that a vCPU with the same id isn't already connected. This is good but unfortunately not enough because we don't map VP ids to raw vCPU ids but to packed vCPU ids, and the packing function kvmppc_pack_vcpu_id() isn't bijective by design. We got away with it because QEMU passes vCPU ids that fit well in the packing pattern. But nothing prevents userspace to come up with a forged vCPU id resulting in a packed id collision which causes the KVM device to associate two vCPUs to the same VP. This greatly confuses the irq layer and ultimately crashes the kernel, as shown below. Example: a guest with 1 guest thread per core, a core stride of 8 and 300 vCPUs has vCPU ids 0,8,16...2392. If QEMU is patched to inject at some point an invalid vCPU id 348, which is the packed version of itself and 2392, we get: genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348) CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38 Call Trace: [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable) [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0 [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260 [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30 [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120 [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80 [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68 xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392 ------------[ cut here ]------------ remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348' WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200 Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38 NIP: c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0 REGS: c000003f7f9934b0 TRAP: 0700 Not tainted (5.3.0-xive-nr-servers-5.3-gku+) MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48228222 XER: 20040000 CFAR: c000000000131a50 IRQMASK: 0 GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057 GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033 GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1 GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006 GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48 GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0 NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200 LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 Call Trace: [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable) [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150 [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0 [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100 [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80 [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30 [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120 [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80 [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68 Instruction dump: 2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378 3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41 ---[ end trace b925b67a74a1d8d1 ]--- BUG: Kernel NULL pointer dereference at 0x00000010 Faulting instruction address: 0xc00800000d44fc04 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm] CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G W 5.3.0-xive-nr-servers-5.3-gku+ #38 NIP: c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970 REGS: c000003f7f9938e0 TRAP: 0300 Tainted: G W (5.3.0-xive-nr-servers-5.3-gku+) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24228882 XER: 20040000 CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0 GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000 GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8 GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1 GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8 GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000 GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90 NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm] LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] Call Trace: [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable) [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm] [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm] [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320 [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0 [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440 [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74 Instruction dump: 3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c 2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac ---[ end trace b925b67a74a1d8d2 ]--- Kernel panic - not syncing: Fatal exception This affects both XIVE and XICS-on-XIVE devices since the beginning. Check the VP id instead of the vCPU id when a new vCPU is connected. The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu() is moved after the check to avoid the need for rollback. Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-10-14arm64: hibernate: check pgd table allocationPavel Tatashin
There is a bug in create_safe_exec_page(), when page table is allocated it is not checked that table is allocated successfully: But it is dereferenced in: pgd_none(READ_ONCE(*pgdp)). Check that allocation was successful. Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-10-14arm64: cpufeature: Treat ID_AA64ZFR0_EL1 as RAZ when SVE is not enabledJulien Grall
If CONFIG_ARM64_SVE=n then we fail to report ID_AA64ZFR0_EL1 as 0 when read by userspace, despite being required by the architecture. Although this is theoretically a change in ABI, userspace will first check for the presence of SVE via the HWCAP or the ID_AA64PFR0_EL1.SVE field before probing the ID_AA64ZFR0_EL1 register. Given that these are reported correctly for this configuration, we can safely tighten up the current behaviour. Ensure ID_AA64ZFR0_EL1 is treated as RAZ when CONFIG_ARM64_SVE=n. Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Dave Martin <dave.martin@arm.com> Fixes: 06a916feca2b ("arm64: Expose SVE2 features for userspace") Signed-off-by: Will Deacon <will@kernel.org>
2019-10-14Merge branch 'aquantia-fixes'David S. Miller
Igor Russkikh says: ==================== Aquantia/Marvell AQtion atlantic driver fixes 10/2019 Here is a set of various bugfixes, to be considered for stable as well. V2: double space removed ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14net: aquantia: correctly handle macvlan and multicast coexistenceDmitry Bogdanov
macvlan and multicast handling is now mixed up. The explicit issue is that macvlan interface gets broken (no traffic) after clearing MULTICAST flag on the real interface. We now do separate logic and consider both ALLMULTI and MULTICAST flags on the device. Fixes: 11ba961c9161 ("net: aquantia: Fix IFF_ALLMULTI flag functionality") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14net: aquantia: do not pass lro session with invalid tcp checksumDmitry Bogdanov
Individual descriptors on LRO TCP session should be checked for CRC errors. It was discovered that HW recalculates L4 checksums on LRO session and does not break it up on bad L4 csum. Thus, driver should aggregate HW LRO L4 statuses from all individual buffers of LRO session and drop packet if one of the buffers has bad L4 checksum. Fixes: f38f1ee8aeb2 ("net: aquantia: check rx csum for all packets in LRO session") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14net: aquantia: when cleaning hw cache it should be toggledIgor Russkikh
>From HW specification to correctly reset HW caches (this is a required workaround when stopping the device), register bit should actually be toggled. It was previosly always just set. Due to the way driver stops HW this never actually caused any issues, but it still may, so cleaning this up. Fixes: 7a1bb49461b1 ("net: aquantia: fix potential IOMMU fault after driver unbind") Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14net: aquantia: temperature retrieval fixIgor Russkikh
Chip temperature is a two byte word, colocated internally with cable length data. We do all readouts from HW memory by dwords, thus we should clear extra high bytes, otherwise temperature output gets weird as soon as we attach a cable to the NIC. Fixes: 8f8940118654 ("net: aquantia: add infrastructure to readout chip temperature") Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more fixes from Andrew Morton: "The usual shower of hotfixes and some followups to the recently merged page_owner enhancements" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once mm/slab.c: fix kernel-doc warning for __ksize() xarray.h: fix kernel-doc warning bitmap.h: fix kernel-doc warning and typo fs/fs-writeback.c: fix kernel-doc warning fs/libfs.c: fix kernel-doc warning fs/direct-io.c: fix kernel-doc warning mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() mm, hugetlb: allow hugepage allocations to reclaim as needed lib/test_meminit: add a kmem_cache_alloc_bulk() test mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations lib/generic-radix-tree.c: add kmemleak annotations mm/slub: fix a deadlock in show_slab_objects() mm, page_owner: rename flag indicating that page is allocated mm, page_owner: decouple freeing stack trace from debug_pagealloc mm, page_owner: fix off-by-one error in __set_page_owner_handle()
2019-10-15gpio: lynxpoint: set default handler to be handle_bad_irq()Andy Shevchenko
We switch the default handler to be handle_bad_irq() instead of handle_simple_irq() (which was not correct anyway). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-15gpio: merrifield: Move hardware initialization to callbackAndy Shevchenko
The driver wants to initialize related registers before IRQ chip will be added. That's why move it to a corresponding callback. It also fixes the NULL pointer dereference. Fixes: 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-15gpio: lynxpoint: Move hardware initialization to callbackAndy Shevchenko
The driver wants to initialize related registers before IRQ chip will be added. That's why move it to a corresponding callback. It also fixes the NULL pointer dereference. Fixes: 7b1e889436a1 ("gpio: lynxpoint: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-15gpio: intel-mid: Move hardware initialization to callbackAndy Shevchenko
The driver wants to initialize related registers before IRQ chip will be added. That's why move it to a corresponding callback. It also fixes the NULL pointer dereference. Fixes: 8069e69a9792 ("gpio: intel-mid: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-15gpiolib: Initialize the hardware with a callbackAndy Shevchenko
After changing the drivers to use GPIO core to add an IRQ chip it appears that some of them requires a hardware initialization before adding the IRQ chip. Add an optional callback ->init_hw() to allow that drivers to initialize hardware if needed. This change is a part of the fix NULL pointer dereference brought to the several drivers recently. Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-15gpio: merrifield: Restore use of irq_baseAndy Shevchenko
During conversion to internal IRQ chip initialization the commit 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip") lost the irq_base assignment. drivers/gpio/gpio-merrifield.c: In function ‘mrfld_gpio_probe’: drivers/gpio/gpio-merrifield.c:405:17: warning: variable ‘irq_base’ set but not used [-Wunused-but-set-variable] Assign the girq->first to it. Fixes: 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-14xtensa: drop EXPORT_SYMBOL for outs*/ins*Max Filippov
Custom outs*/ins* implementations are long gone from the xtensa port, remove matching EXPORT_SYMBOLs. This fixes the following build warnings issued by modpost since commit 15bfc2348d54 ("modpost: check for static EXPORT_SYMBOL* functions"): WARNING: "insb" [vmlinux] is a static EXPORT_SYMBOL WARNING: "insw" [vmlinux] is a static EXPORT_SYMBOL WARNING: "insl" [vmlinux] is a static EXPORT_SYMBOL WARNING: "outsb" [vmlinux] is a static EXPORT_SYMBOL WARNING: "outsw" [vmlinux] is a static EXPORT_SYMBOL WARNING: "outsl" [vmlinux] is a static EXPORT_SYMBOL Cc: stable@vger.kernel.org Fixes: d38efc1f150f ("xtensa: adopt generic io routines") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
2019-10-14mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped ↵Jane Chu
more than once Mmap /dev/dax more than once, then read the poison location using address from one of the mappings. The other mappings due to not having the page mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds over SIGBUS, so user process loses the opportunity to handle the UE. Although one may add MAP_POPULATE to mmap(2) to work around the issue, MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so isn't always an option. Details - ndctl inject-error --block=10 --count=1 namespace6.0 ./read_poison -x dax6.0 -o 5120 -m 2 mmaped address 0x7f5bb6600000 mmaped address 0x7f3cf3600000 doing local read at address 0x7f3cf3601400 Killed Console messages in instrumented kernel - mce: Uncorrected hardware memory error in user-access at edbe201400 Memory failure: tk->addr = 7f5bb6601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift dev_pagemap_mapping_shift: page edbe201: no PUD Memory failure: tk->size_shift == 0 Memory failure: Unable to find user space address edbe201 in read_poison Memory failure: tk->addr = 7f3cf3601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift Memory failure: tk->size_shift = 21 Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page => to deliver SIGKILL Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption => to deliver SIGBUS Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14mm/slab.c: fix kernel-doc warning for __ksize()Randy Dunlap
Fix kernel-doc warning in mm/slab.c: mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize' Also add Return: documentation section for this function. Link: http://lkml.kernel.org/r/68c9fd7d-f09e-d376-e292-c7b2bdf1774d@infradead.org Fixes: 10d1f8cb3965 ("mm/slab: refactor common ksize KASAN logic into slab_common.c") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14xarray.h: fix kernel-doc warningRandy Dunlap
Fix (Sphinx) kernel-doc warning in <linux/xarray.h>: include/linux/xarray.h:232: WARNING: Unexpected indentation. Link: http://lkml.kernel.org/r/89ba2134-ce23-7c10-5ee1-ef83b35aa984@infradead.org Fixes: a3e4d3f97ec8 ("XArray: Redesign xa_alloc API") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14bitmap.h: fix kernel-doc warning and typoRandy Dunlap
Fix kernel-doc warning in <linux/bitmap.h>: include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal' Also fix small typo (bitnaps). Link: http://lkml.kernel.org/r/0729ea7a-2c0d-b2c5-7dd3-3629ee0803e2@infradead.org Fixes: b9fa6442f704 ("cpumask: Implement cpumask_or_equal()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14fs/fs-writeback.c: fix kernel-doc warningRandy Dunlap
Fix kernel-doc warning in fs/fs-writeback.c: fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id' Link: http://lkml.kernel.org/r/756645ac-0ce8-d47e-d30a-04d9e4923a4f@infradead.org Fixes: d62241c7a406 ("writeback, memcg: Implement cgroup_writeback_by_id()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14fs/libfs.c: fix kernel-doc warningRandy Dunlap
Fix kernel-doc warning in fs/libfs.c: fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end' Link: http://lkml.kernel.org/r/5fc9d70b-e377-0ec9-066a-970d49579041@infradead.org Fixes: ad2a722f196d ("libfs: Open code simple_commit_write into only user") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Boaz Harrosh <boazh@netapp.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14fs/direct-io.c: fix kernel-doc warningRandy Dunlap
Fix kernel-doc warning in fs/direct-io.c: fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete' Also, don't mark this function as having kernel-doc notation since it is not exported. Link: http://lkml.kernel.org/r/97908511-4328-4a56-17fe-f43a1d7aa470@infradead.org Fixes: 6d544bb4d901 ("dio: centralize completion in dio_complete()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Zach Brown <zab@zabbo.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()Vlastimil Babka
Florian and Dave reported [1] a NULL pointer dereference in __reset_isolation_pfn(). While the exact cause is unclear, staring at the code revealed two bugs, which might be related. One bug is that if zone starts in the middle of pageblock, block_page might correspond to different pfn than block_pfn, and then the pfn_valid_within() checks will check different pfn's than those accessed via struct page. This might result in acessing an unitialized page in CONFIG_HOLES_IN_ZONE configs. The other bug is that end_page refers to the first page of next pageblock and not last page of current pageblock. The online and valid check is then wrong and with sections, the while (page < end_page) loop might wander off actual struct page arrays. [1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/ Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Florian Weimer <fw@deneb.enyo.de> Reported-by: Dave Chinner <david@fromorbit.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14mm, hugetlb: allow hugepage allocations to reclaim as neededDavid Rientjes
Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") has chnaged the allocator to bail out from the allocator early to prevent from a potentially excessive memory reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, reclaim and compaction loop as long as there is a reasonable chance to make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. The most obvious affected subsystem is hugetlbfs which allocates huge pages based on an admin request (or via admin configured overcommit). I have done a simple test which tries to allocate half of the memory for hugetlb pages while the memory is full of a clean page cache. This is not an unusual situation because we try to cache as much of the memory as possible and sysctl/sysfs interface to allocate huge pages is there for flexibility to allocate hugetlb pages at any time. System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages after the memory is prefilled by a clean page cache: root@test1:~# cat hugetlb_test.sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 Now with b39d0ee2632d applied root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + echo 256 + cat /proc/sys/vm/nr_hugepages 11 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + echo 256 + cat /proc/sys/vm/nr_hugepages 12 The success rate went down by factor of 20! Although hugetlb allocation requests might fail and it is reasonable to expect them to under extremely fragmented memory or when the memory is under a heavy pressure but the above situation is not that case. Fix the regression by reverting back to the previous behavior for __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for those requests. Mike said: : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after : boot where this may not be as much of an issue. However, I am aware of at : least three use cases where allocations are made after the system has been : up and running for quite some time: : : - DB reconfiguration. If sysctl/sysfs fails to get required number of : huge pages, system is rebooted to perform allocation after boot. : : - VM provisioning. If unable get required number of huge pages, fall : back to base pages. : : - An application that does not preallocate pool, but rather allocates : pages at fault time for optimal NUMA locality. : : In all cases, I would expect b39d0ee2632d to cause regressions and : noticable behavior changes. : : My quick/limited testing in : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com : was insufficient. It was also mentioned that if something like : b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL : requests as in this patch. [mhocko@suse.com: reworded changelog] Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14lib/test_meminit: add a kmem_cache_alloc_bulk() testAlexander Potapenko
Make sure allocations from kmem_cache_alloc_bulk() and kmem_cache_free_bulk() are properly initialized. Link: http://lkml.kernel.org/r/20191007091605.30530-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Thibaut Sautereau <thibaut@sautereau.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>