summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Linux 5.1drm-tda998x-fixesdrm-armada-fixesLinus Torvalds2019-05-051-1/+1
|
* Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2019-05-0518-32/+272
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "I'd like to apologize for this very late pull request: I was dithering through the week whether to send the fixes, and then yesterday Jiri's crash fix for a regression introduced in this cycle clearly marked perf/urgent as 'must merge now'. Most of the commits are tooling fixes, plus there's three kernel fixes via four commits: - race fix in the Intel PEBS code - fix an AUX bug and roll back a previous attempt - fix AMD family 17h generic HW cache-event perf counters The largest diffstat contribution comes from the AMD fix - a new event table is introduced, which is a fairly low risk change but has a large linecount" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix race in intel_pmu_disable_event() perf/x86/intel/pt: Remove software double buffering PMU capability perf/ring_buffer: Fix AUX software double buffering perf tools: Remove needless asm/unistd.h include fixing build in some places tools arch uapi: Copy missing unistd.h headers for arc, hexagon and riscv tools build: Add -ldl to the disassembler-four-args feature test perf cs-etm: Always allocate memory for cs_etm_queue::prev_packet perf cs-etm: Don't check cs_etm_queue::prev_packet validity perf report: Report OOM in status line in the GTK UI perf bench numa: Add define for RUSAGE_THREAD if not present tools lib traceevent: Change tag string for error perf annotate: Fix build on 32 bit for BPF annotation tools uapi x86: Sync vmx.h with the kernel perf bpf: Return value with unlocking in perf_env__find_btf() MAINTAINERS: Include vendor specific files under arch/*/events/* perf/x86/amd: Update generic hardware cache events for Family 17h
| * perf/x86/intel: Fix race in intel_pmu_disable_event()Jiri Olsa2019-05-051-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New race in x86_pmu_stop() was introduced by replacing the atomic __test_and_clear_bit() of cpuc->active_mask by separate test_bit() and __clear_bit() calls in the following commit: 3966c3feca3f ("x86/perf/amd: Remove need to check "running" bit in NMI handler") The race causes panic for PEBS events with enabled callchains: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 ... RIP: 0010:perf_prepare_sample+0x8c/0x530 Call Trace: <NMI> perf_event_output_forward+0x2a/0x80 __perf_event_overflow+0x51/0xe0 handle_pmi_common+0x19e/0x240 intel_pmu_handle_irq+0xad/0x170 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x69/0x110 default_do_nmi+0x3e/0x100 do_nmi+0x11a/0x180 end_repeat_nmi+0x16/0x1a RIP: 0010:native_write_msr+0x6/0x20 ... </NMI> intel_pmu_disable_event+0x98/0xf0 x86_pmu_stop+0x6e/0xb0 x86_pmu_del+0x46/0x140 event_sched_out.isra.97+0x7e/0x160 ... The event is configured to make samples from PEBS drain code, but when it's disabled, we'll go through NMI path instead, where data->callchain will not get allocated and we'll crash: x86_pmu_stop test_bit(hwc->idx, cpuc->active_mask) intel_pmu_disable_event(event) { ... intel_pmu_pebs_disable(event); ... EVENT OVERFLOW -> <NMI> intel_pmu_handle_irq handle_pmi_common TEST PASSES -> test_bit(bit, cpuc->active_mask)) perf_event_overflow perf_prepare_sample { ... if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY)) data->callchain = perf_callchain(event, regs); CRASH -> size += data->callchain->nr; } </NMI> ... x86_pmu_disable_event(event) } __clear_bit(hwc->idx, cpuc->active_mask); Fixing this by disabling the event itself before setting off the PEBS bit. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Arcari <darcari@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Lendacky Thomas <Thomas.Lendacky@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 3966c3feca3f ("x86/perf/amd: Remove need to check "running" bit in NMI handler") Link: http://lkml.kernel.org/r/20190504151556.31031-1-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/intel/pt: Remove software double buffering PMU capabilityAlexander Shishkin2019-05-032-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all AUX allocations are high-order by default, the software double buffering PMU capability doesn't make sense any more, get rid of it. In case some PMUs choose to opt out, we can re-introduce it. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/ring_buffer: Fix AUX software double bufferingAlexander Shishkin2019-05-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This recent commit: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") overlooked the fact that the previous one page granularity of the AUX buffer provided an implicit double buffering capability to the PMU driver, which went away when the entire buffer became one high-order page. Always make the full-trace mode AUX allocation at least two-part to preserve the previous behavior and allow the implicit double buffering to continue. Reported-by: Ammy Yi <ammy.yi@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Fixes: 5768402fd9c6e87 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically") Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * Merge tag 'perf-urgent-for-mingo-5.1-20190502' of ↵Ingo Molnar2019-05-0312-21/+154
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: tools UAPI: Arnaldo Carvalho de Melo: - Sync x86's vmx.h with the kernel. - Copy missing unistd.h headers for arc, hexagon and riscv, fixing a reported build regression on the ARC 32-bit architecture. perf bench numa: Arnaldo Carvalho de Melo: - Add define for RUSAGE_THREAD if not present, fixing the build on the ARC architecture when only zlib and libnuma are present. perf BPF: Arnaldo Carvalho de Melo: - The disassembler-four-args feature test needs -ldl on distros such as Mageia 7. Bo YU: - Fix unlocking on success in perf_env__find_btf(), detected with the coverity tool. libtraceevent: Leo Yan: - Change misleading hard coded 'trace-cmd' string in error messages. ARM hardware tracing: Leo Yan: - Always allocate memory for cs_etm_queue::prev_packet, fixing a segfault when processing CoreSight perf data. perf annotate: Thadeu Lima de Souza Cascardo: - Fix build on 32 bit for BPF. perf report: Thomas Richter: - Report OOM in status line in the GTK UI. core libs: - Remove needless asm/unistd.h that, used with sys/syscall.h ended up redefining the syscalls defines in environments such as the ARC arch when using uClibc. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * perf tools: Remove needless asm/unistd.h include fixing build in some placesArnaldo Carvalho de Melo2019-05-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were including sys/syscall.h and asm/unistd.h, since sys/syscall.h includes asm/unistd.h, sometimes this leads to the redefinition of defines, breaking the build. Noticed on ARC with uCLibc. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Link: https://lkml.kernel.org/n/tip-xjpf80o64i2ko74aj2jih0qg@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * tools arch uapi: Copy missing unistd.h headers for arc, hexagon and riscvArnaldo Carvalho de Melo2019-05-023-0/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since those were introduced in: c8ce48f06503 ("asm-generic: Make time32 syscall numbers optional") But when the asm-generic/unistd.h was sync'ed with tools/ in: 1a787fc5ba18 ("tools headers uapi: Sync copy of asm-generic/unistd.h with the kernel sources") I forgot to copy the files for the architectures that define __ARCH_WANT_TIME32_SYSCALLS, so the perf build was breaking there, as reported by Vineet Gupta for the ARC architecture. After updating my ARC container to use the glibc based toolchain + cross building libnuma, zlib and elfutils, I finally managed to reproduce the problem and verify that this now is fixed and will not regress as will be tested before each pull req sent upstream. Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Olsa <jolsa@kernel.org> CC: linux-snps-arc@lists.infradead.org Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/20190426193531.GC28586@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * tools build: Add -ldl to the disassembler-four-args feature testArnaldo Carvalho de Melo2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thomas Backlund reported that the perf build was failing on the Mageia 7 distro, that is because it uses: cat /tmp/build/perf/feature/test-disassembler-four-args.make.output /usr/bin/ld: /usr/lib64/libbfd.a(plugin.o): in function `try_load_plugin': /home/iurt/rpmbuild/BUILD/binutils-2.32/objs/bfd/../../bfd/plugin.c:243: undefined reference to `dlopen' /usr/bin/ld: /home/iurt/rpmbuild/BUILD/binutils-2.32/objs/bfd/../../bfd/plugin.c:271: undefined reference to `dlsym' /usr/bin/ld: /home/iurt/rpmbuild/BUILD/binutils-2.32/objs/bfd/../../bfd/plugin.c:256: undefined reference to `dlclose' /usr/bin/ld: /home/iurt/rpmbuild/BUILD/binutils-2.32/objs/bfd/../../bfd/plugin.c:246: undefined reference to `dlerror' as we allow dynamic linking and loading Mageia 7 uses these linker flags: $ rpm --eval %ldflags  -Wl,--as-needed -Wl,--no-undefined -Wl,-z,relro -Wl,-O1 -Wl,--build-id -Wl,--enable-new-dtags So add -ldl to this feature LDFLAGS. Reported-by: Thomas Backlund <tmb@mageia.org> Tested-by: Thomas Backlund <tmb@mageia.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lkml.kernel.org/r/20190501173158.GC21436@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf cs-etm: Always allocate memory for cs_etm_queue::prev_packetLeo Yan2019-05-021-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Robert Walker reported a segmentation fault is observed when process CoreSight trace data; this issue can be easily reproduced by the command 'perf report --itrace=i1000i' for decoding tracing data. If neither the 'b' flag (synthesize branches events) nor 'l' flag (synthesize last branch entries) are specified to option '--itrace', cs_etm_queue::prev_packet will not been initialised. After merging the code to support exception packets and sample flags, there introduced a number of uses of cs_etm_queue::prev_packet without checking whether it is valid, for these cases any accessing to uninitialised prev_packet will cause crash. As cs_etm_queue::prev_packet is used more widely now and it's already hard to follow which functions have been called in a context where the validity of cs_etm_queue::prev_packet has been checked, this patch always allocates memory for cs_etm_queue::prev_packet. Reported-by: Robert Walker <robert.walker@arm.com> Suggested-by: Robert Walker <robert.walker@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Tested-by: Robert Walker <robert.walker@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Suzuki K Poulouse <suzuki.poulose@arm.com> Cc: linux-arm-kernel@lists.infradead.org Fixes: 7100b12cf474 ("perf cs-etm: Generate branch sample for exception packet") Fixes: 24fff5eb2b93 ("perf cs-etm: Avoid stale branch samples when flush packet") Link: http://lkml.kernel.org/r/20190428083228.20246-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf cs-etm: Don't check cs_etm_queue::prev_packet validityLeo Yan2019-05-021-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since cs_etm_queue::prev_packet is allocated for all cases, it will never be NULL pointer; now validity checking prev_packet is pointless, remove all of them. Signed-off-by: Leo Yan <leo.yan@linaro.org> Tested-by: Robert Walker <robert.walker@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Suzuki K Poulouse <suzuki.poulose@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190428083228.20246-2-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf report: Report OOM in status line in the GTK UIThomas Richter2019-05-021-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An -ENOMEM error is not reported in the GTK GUI. Instead this error message pops up on the screen: [root@m35lp76 perf]# ./perf report -i perf.data.error68-1 Processing events... [974K/3M] Error:failed to process sample 0xf4198 [0x8]: failed to process type: 68 However when I use the same perf.data file with --stdio it works: [root@m35lp76 perf]# ./perf report -i perf.data.error68-1 --stdio \ | head -12 # Total Lost Samples: 0 # # Samples: 76K of event 'cycles' # Event count (approx.): 99056160000 # # Overhead Command Shared Object Symbol # ........ ............... ................. ......... # 8.81% find [kernel.kallsyms] [k] ftrace_likely_update 8.74% swapper [kernel.kallsyms] [k] ftrace_likely_update 8.34% sshd [kernel.kallsyms] [k] ftrace_likely_update 2.19% kworker/u512:1- [kernel.kallsyms] [k] ftrace_likely_update The sample precentage is a bit low..... The GUI always fails in the FINISHED_ROUND event (68) and does not indicate the reason why. When happened is the following. Perf report calls a lot of functions and down deep when a FINISHED_ROUND event is processed, these functions are called: perf_session__process_event() + perf_session__process_user_event() + process_finished_round() + ordered_events__flush() + __ordered_events__flush() + do_flush() + ordered_events__deliver_event() + perf_session__deliver_event() + machine__deliver_event() + perf_evlist__deliver_event() + process_sample_event() + hist_entry_iter_add() --> only called in GUI case!!! + hist_iter__report__callback() + symbol__inc_addr_sample() Now this functions runs out of memory and returns -ENOMEM. This is reported all the way up until function perf_session__process_event() returns to its caller, where -ENOMEM is changed to -EINVAL and processing stops: if ((skip = perf_session__process_event(session, event, head)) < 0) { pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n", head, event->header.size, event->header.type); err = -EINVAL; goto out_err; } This occurred in the FINISHED_ROUND event when it has to process some 10000 entries and ran out of memory. This patch indicates the root cause and displays it in the status line of ther perf report GUI. Output before (on GUI status line): 0xf4198 [0x8]: failed to process type: 68 Output after: 0xf4198 [0x8]: failed to process type: 68 [not enough memory] Committer notes: the 'skip' variable needs to be initialized to -EINVAL, so that when the size is less than sizeof(struct perf_event_attr) we avoid this valid compiler warning: util/session.c: In function ‘perf_session__process_events’: util/session.c:1936:7: error: ‘skip’ may be used uninitialized in this function [-Werror=maybe-uninitialized] err = skip; ~~~~^~~~~~ util/session.c:1874:6: note: ‘skip’ was declared here s64 skip; ^~~~ cc1: all warnings being treated as errors Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20190423105303.61683-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf bench numa: Add define for RUSAGE_THREAD if not presentArnaldo Carvalho de Melo2019-05-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While cross building perf to the ARC architecture on a fedora 30 host, we were failing with: CC /tmp/build/perf/bench/numa.o bench/numa.c: In function ‘worker_thread’: bench/numa.c:1261:12: error: ‘RUSAGE_THREAD’ undeclared (first use in this function); did you mean ‘SIGEV_THREAD’? getrusage(RUSAGE_THREAD, &rusage); ^~~~~~~~~~~~~ SIGEV_THREAD bench/numa.c:1261:12: note: each undeclared identifier is reported only once for each function it appears in [perfbuilder@60d5802468f6 perf]$ /arc_gnu_2019.03-rc1_prebuilt_uclibc_le_archs_linux_install/bin/arc-linux-gcc --version | head -1 arc-linux-gcc (ARCv2 ISA Linux uClibc toolchain 2019.03-rc1) 8.3.1 20190225 [perfbuilder@60d5802468f6 perf]$ Trying to reproduce a report by Vineet, I noticed that, with just cross-built zlib and numactl libraries, I ended up with the above failure. So, since RUSAGE_THREAD is available as a define, check for that and numactl libraries, I ended up with the above failure. So, since RUSAGE_THREAD is available as a define in the system headers, check if it is defined in the 'perf bench numa' sources and define it if not. Now it builds and I have to figure out if the problem reported by Vineet only takes place if we have libelf or some other library available. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Olsa <jolsa@kernel.org> Cc: linux-snps-arc@lists.infradead.org Cc: Namhyung Kim <namhyung@kernel.org> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Link: https://lkml.kernel.org/n/tip-2wb4r1gir9xrevbpq7qp0amk@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * tools lib traceevent: Change tag string for errorLeo Yan2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The traceevent lib is used by the perf tool, and when executing perf test -v 6 it outputs error log on the ARM64 platform: running test 33 '*:*'trace-cmd: No such file or directory [...] trace-cmd: Invalid argument The trace event parsing code originally came from trace-cmd so it keeps the tag string "trace-cmd" for errors, this easily introduces the impression that the perf tool launches trace-cmd command for trace event parsing, but in fact the related parsing is accomplished by the traceevent lib. This patch changes the tag string to "libtraceevent" so that we can avoid confusion and let users to more easily connect the error with traceevent lib. Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20190424013802.27569-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf annotate: Fix build on 32 bit for BPF annotationThadeu Lima de Souza Cascardo2019-05-021-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6987561c9e86 ("perf annotate: Enable annotation of BPF programs") adds support for BPF programs annotations but the new code does not build on 32-bit. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Fixes: 6987561c9e86 ("perf annotate: Enable annotation of BPF programs") Link: http://lkml.kernel.org/r/20190403194452.10845-1-cascardo@canonical.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * tools uapi x86: Sync vmx.h with the kernelArnaldo Carvalho de Melo2019-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To pick up the changes from: 2b27924bb1d4 ("KVM: nVMX: always use early vmcs check when EPT is disabled") That causes this object in the tools/perf build process to be rebuilt: CC /tmp/build/perf/arch/x86/util/kvm-stat.o But it isn't using VMX_ABORT_ prefixed constants, so no change in behaviour. This silences this perf build warning: Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/vmx.h' differs from latest version at 'arch/x86/include/uapi/asm/vmx.h' diff -u tools/arch/x86/include/uapi/asm/vmx.h arch/x86/include/uapi/asm/vmx.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lkml.kernel.org/n/tip-bjbo3zc0r8i8oa0udpvftya6@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * perf bpf: Return value with unlocking in perf_env__find_btf()Bo YU2019-05-021-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In perf_env__find_btf(), we're returning without unlocking "env->bpf_progs.lock". There may be cause lockdep issue. Detected by CoversityScan, CID# 1444762:(program hangs(LOCK)) Signed-off-by: Bo YU <tsu.yubo@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: bpf@vger.kernel.org Cc: netdev@vger.kernel.org Fixes: 2db7b1e0bd49d: (perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf()) Link: http://lkml.kernel.org/r/20190422080138.10088-1-tsu.yubo@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * MAINTAINERS: Include vendor specific files under arch/*/events/*Kim Phillips2019-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an explicit subdirectory specification for arch/x86/events/amd to the MAINTAINERS file, to distinguish it from its parent. This will produce the correct set of maintainers for the files found therein. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Gary Hook <Gary.Hook@amd.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Liška <mliska@suse.cz> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Fixes: 39b0332a2158 ("perf/x86: Move perf_event_amd.c ........... => x86/events/amd/core.c") Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * perf/x86/amd: Update generic hardware cache events for Family 17hKim Phillips2019-05-021-3/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new amd_hw_cache_event_ids_f17h assignment structure set for AMD families 17h and above, since a lot has changed. Specifically: L1 Data Cache The data cache access counter remains the same on Family 17h. For DC misses, PMCx041's definition changes with Family 17h, so instead we use the L2 cache accesses from L1 data cache misses counter (PMCx060,umask=0xc8). For DC hardware prefetch events, Family 17h breaks compatibility for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware Prefetch DC Fills." L1 Instruction Cache PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward compatible on Family 17h. For prefetches, we remove the erroneous PMCx04B assignment which counts how many software data cache prefetch load instructions were dispatched. LL - Last Level Cache Removing PMCs 7D, 7E, and 7F assignments, as they do not exist on Family 17h, where the last level cache is L3. L3 counters can be accessed using the existing AMD Uncore driver. Data TLB On Intel machines, data TLB accesses ("dTLB-loads") are assigned to counters that count load/store instructions retired. This is inconsistent with instruction TLB accesses, where Intel implementations report iTLB misses that hit in the STLB. Ideally, dTLB-loads would count higher level dTLB misses that hit in lower level TLBs, and dTLB-load-misses would report those that also missed in those lower-level TLBs, therefore causing a page table walk. That would be consistent with instruction TLB operation, remove the redundancy between dTLB-loads and L1-dcache-loads, and prevent perf from producing artificially low percentage ratios, i.e. the "0.01%" below: 42,550,869 L1-dcache-loads 41,591,860 dTLB-loads 4,802 dTLB-load-misses # 0.01% of all dTLB cache hits 7,283,682 L1-dcache-stores 7,912,392 dTLB-stores 310 dTLB-store-misses On AMD Families prior to 17h, the "Data Cache Accesses" counter is used, which is slightly better than load/store instructions retired, but still counts in terms of individual load/store operations instead of TLB operations. So, for AMD Families 17h and higher, this patch assigns "dTLB-loads" to a counter for L1 dTLB misses that hit in the L2 dTLB, and "dTLB-load-misses" to a counter for L1 DTLB misses that caused L2 DTLB misses and therefore also caused page table walks. This results in a much more accurate view of data TLB performance: 60,961,781 L1-dcache-loads 4,601 dTLB-loads 963 dTLB-load-misses # 20.93% of all dTLB cache hits Note that for all AMD families, data loads and stores are combined in a single accesses counter, so no 'L1-dcache-stores' are reported separately, and stores are counted with loads in 'L1-dcache-loads'. Also note that the "% of all dTLB cache hits" string is misleading because (a) "dTLB cache": although TLBs can be considered caches for page tables, in this context, it can be misinterpreted as data cache hits because the figures are similar (at least on Intel), and (b) not all those loads (technically accesses) technically "hit" at that hardware level. "% of all dTLB accesses" would be more clear/accurate. Instruction TLB On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in the STLB and completed a page table walk. For AMD Family 17h and above, for 'iTLB-loads' we replace the erroneous instruction cache fetches counter with PMCx084 "L1 ITLB Miss, L2 ITLB Hit". For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss, L2 ITLB Miss", but set a 0xff umask because without it the event does not get counted. Branch Predictor (BPU) PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families. Node Level Events Family 17h does not have a PMCx0e9 counter, and corresponding counters have not been made available publicly, so for now, we mark them as unsupported for Families 17h and above. Reference: "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh" Released 7/17/2018, Publication #56255, Revision 3.03: https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf [ mingo: tidied up the line breaks. ] Signed-off-by: Kim Phillips <kim.phillips@amd.com> Cc: <stable@vger.kernel.org> # v4.9+ Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Liška <mliska@suse.cz> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors") Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2019-05-051-0/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a kobject memory leak in the cpufreq code" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cpufreq: Fix kobject memleak
| * | sched/cpufreq: Fix kobject memleakTobin C. Harding2019-04-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the error return path from kobject_init_and_add() is not followed by a call to kobject_put() - which means we are leaking the kobject. Fix it by adding a call to kobject_put() in the error path of kobject_init_and_add(). Signed-off-by: Tobin C. Harding <tobin@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-05-052-0/+23
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "Disable function tracing during early SME setup to fix a boot crash on SME-enabled kernels running distro kernels (some of which have function tracing enabled)" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/mem_encrypt: Disable all instrumentation for early SME setup
| * | | x86/mm/mem_encrypt: Disable all instrumentation for early SME setupGary Hook2019-04-302-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enablement of AMD's Secure Memory Encryption feature is determined very early after start_kernel() is entered. Part of this procedure involves scanning the command line for the parameter 'mem_encrypt'. To determine intended state, the function sme_enable() uses library functions cmdline_find_option() and strncmp(). Their use occurs early enough such that it cannot be assumed that any instrumentation subsystem is initialized. For example, making calls to a KASAN-instrumented function before KASAN is set up will result in the use of uninitialized memory and a boot failure. When AMD's SME support is enabled, conditionally disable instrumentation of these dependent functions in lib/string.c and arch/x86/lib/cmdline.c. [ bp: Get rid of intermediary nostackp var and cleanup whitespace. ] Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Reported-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Gary R Hook <gary.hook@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Boris Brezillon <bbrezillon@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: "luto@kernel.org" <luto@kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "mingo@redhat.com" <mingo@redhat.com> Cc: "peterz@infradead.org" <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/155657657552.7116.18363762932464011367.stgit@sosrh3.amd.com
* | | | Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2019-05-055-16/+26
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vfs fixes from Al Viro: - a couple of ->i_link use-after-free fixes - regression fix for wrong errno on absent device name in mount(2) (this cycle stuff) - ancient UFS braino in large GID handling on Solaris UFS images (bogus cut'n'paste from large UID handling; wrong field checked to decide whether we should look at old (16bit) or new (32bit) field) * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour Abort file_remove_privs() for non-reg. files [fix] get rid of checking for absent device name in vfs_get_tree() apparmorfs: fix use-after-free on symlink traversal securityfs: fix use-after-free on symlink traversal
| * | | | ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavourAl Viro2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To choose whether to pick the GID from the old (16bit) or new (32bit) field, we should check if the old gid field is set to 0xffff. Mainline checks the old *UID* field instead - cut'n'paste from the corresponding code in ufs_get_inode_uid(). Fixes: 252e211e90ce Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | Abort file_remove_privs() for non-reg. filesAlexander Lochmann2019-04-281-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | file_remove_privs() might be called for non-regular files, e.g. blkdev inode. There is no reason to do its job on things like blkdev inodes, pipes, or cdevs. Hence, abort if file does not refer to a regular inode. AV: more to the point, for devices there might be any number of inodes refering to given device. Which one to strip the permissions from, even if that made any sense in the first place? All of them will be observed with contents modified, after all. Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf Spinczyk) Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alexander Lochmann <alexander.lochmann@tu-dortmund.de> Signed-off-by: Horst Schirmeier <horst.schirmeier@tu-dortmund.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | [fix] get rid of checking for absent device name in vfs_get_tree()Al Viro2019-04-281-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has no business being there, it's checked by relevant ->get_tree() as it is *and* it returns the wrong error for no reason whatsoever. Fixes: f3a09c92018a "introduce fs_context methods" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | apparmorfs: fix use-after-free on symlink traversalAl Viro2019-04-101-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | symlink body shouldn't be freed without an RCU delay. Switch apparmorfs to ->destroy_inode() and use of call_rcu(); free both the inode and symlink body in the callback. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | securityfs: fix use-after-free on symlink traversalAl Viro2019-04-101-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | symlink body shouldn't be freed without an RCU delay. Switch securityfs to ->destroy_inode() and use of call_rcu(); free both the inode and symlink body in the callback. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | Merge tag 'powerpc-5.1-7' of ↵Linus Torvalds2019-05-041-4/+14
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "One regression fix. Changes we merged to STRICT_KERNEL_RWX on 32-bit were causing crashes under load on some machines depending on memory layout. Thanks to Christophe Leroy" * tag 'powerpc-5.1-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/32s: Fix BATs setting with CONFIG_STRICT_KERNEL_RWX
| * | | | | powerpc/32s: Fix BATs setting with CONFIG_STRICT_KERNEL_RWXChristophe Leroy2019-05-021-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Serge reported some crashes with CONFIG_STRICT_KERNEL_RWX enabled on a book3s32 machine. Analysis shows two issues: - BATs addresses and sizes are not properly aligned. - There is a gap between the last address covered by BATs and the first address covered by pages. Memory mapped with DBATs: 0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent 1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent 2: 0xc0c00000-0xc13fffff 0x00c00000 Kernel RW coherent 3: 0xc1400000-0xc23fffff 0x01400000 Kernel RW coherent 4: 0xc2400000-0xc43fffff 0x02400000 Kernel RW coherent 5: 0xc4400000-0xc83fffff 0x04400000 Kernel RW coherent 6: 0xc8400000-0xd03fffff 0x08400000 Kernel RW coherent 7: 0xd0400000-0xe03fffff 0x10400000 Kernel RW coherent Memory mapped with pages: 0xe1000000-0xefffffff 0x21000000 240M rw present dirty accessed This patch fixes both issues. With the patch, we get the following which is as expected: Memory mapped with DBATs: 0: 0xc0000000-0xc07fffff 0x00000000 Kernel RO coherent 1: 0xc0800000-0xc0bfffff 0x00800000 Kernel RO coherent 2: 0xc0c00000-0xc0ffffff 0x00c00000 Kernel RW coherent 3: 0xc1000000-0xc1ffffff 0x01000000 Kernel RW coherent 4: 0xc2000000-0xc3ffffff 0x02000000 Kernel RW coherent 5: 0xc4000000-0xc7ffffff 0x04000000 Kernel RW coherent 6: 0xc8000000-0xcfffffff 0x08000000 Kernel RW coherent 7: 0xd0000000-0xdfffffff 0x10000000 Kernel RW coherent Memory mapped with pages: 0xe0000000-0xefffffff 0x20000000 256M rw present dirty accessed Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX") Reported-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Acked-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-05-0323-65/+192
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: - PPC and ARM bugfixes from submaintainers - Fix old Windows versions on AMD (recent regression) - Fix old Linux versions on processors without EPT - Fixes for LAPIC timer optimizations * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: nVMX: Fix size checks in vmx_set_nested_state KVM: selftests: make hyperv_cpuid test pass on AMD KVM: lapic: Check for in-kernel LAPIC before deferencing apic pointer KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size x86/kvm/mmu: reset MMU context when 32-bit guest switches PAE KVM: x86: Whitelist port 0x7e for pre-incrementing %rip Documentation: kvm: fix dirty log ioctl arch lists KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit KVM: arm/arm64: Don't emulate virtual timers on userspace ioctls kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THP KVM: arm/arm64: Ensure vcpu target is unset on reset failure KVM: lapic: Convert guest TSC to host time domain if necessary KVM: lapic: Allow user to disable adaptive tuning of timer advancement KVM: lapic: Track lapic timer advance per vCPU KVM: lapic: Disable timer advancement if adaptive tuning goes haywire x86: kvm: hyper-v: deal with buggy TLB flush requests from WS2012 KVM: x86: Consider LAPIC TSC-Deadline timer expired if deadline too short KVM: PPC: Book3S: Protect memslots while validating user address KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit KVM: arm/arm64: vgic-v3: Retire pending interrupts on disabling LPIs ...
| * | | | | | KVM: nVMX: Fix size checks in vmx_set_nested_stateJim Mattson2019-05-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The size checks in vmx_nested_state are wrong because the calculations are made based on the size of a pointer to a struct kvm_nested_state rather than the size of a struct kvm_nested_state. Reported-by: Felix Wilhelm <fwilhelm@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: 8fcc4b5923af5de58b80b53a069453b135693304 Cc: stable@ver.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | Merge tag 'kvmarm-fixes-for-5.1-2' of ↵Paolo Bonzini2019-04-306-11/+48
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM fixes for 5.1, take #2: - Don't try to emulate timers on userspace access - Fix unaligned huge mappings, again - Properly reset a vcpu that fails to reset(!) - Properly retire pending LPIs on reset - Fix computation of emulated CNTP_TVAL
| | * | | | | | KVM: arm/arm64: Don't emulate virtual timers on userspace ioctlsChristoffer Dall2019-04-251-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a VCPU never runs before a guest exists, but we set timer registers up via ioctls, the associated hrtimer might never get cancelled. Since we moved vcpu_load/put into the arch-specific implementations and only have load/put for KVM_RUN, we won't ever have a scheduled hrtimer for emulating a timer when modifying the timer state via an ioctl from user space. All we need to do is make sure that we pick up the right state when we load the timer state next time userspace calls KVM_RUN again. We also do not need to worry about this interacting with the bg_timer, because if we were in WFI from the guest, and somehow ended up in a kvm_arm_timer_set_reg, it means that: 1. the VCPU thread has received a signal, 2. we have called vcpu_load when being scheduled in again, 3. we have called vcpu_put when we returned to userspace for it to issue another ioctl And therefore will not have a bg_timer programmed and the event is treated as a spurious wakeup from WFI if userspace decides to run the vcpu again even if there are not virtual interrupts. This fixes stray virtual timer interrupts triggered by an expiring hrtimer, which happens after a failed live migration, for instance. Fixes: bee038a674875 ("KVM: arm/arm64: Rework the timer code to use a timer_map") Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Reported-by: Andre Przywara <andre.przywara@arm.com> Tested-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | | | | | kvm: arm: Skip stage2 huge mappings for unaligned ipa backed by THPSuzuki K Poulose2019-04-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit a80868f398554842b14, we no longer ensure that the THP page is properly aligned in the guest IPA. Skip the stage2 huge mapping for unaligned IPA backed by transparent hugepages. Fixes: a80868f398554842b14 ("KVM: arm/arm64: Enforce PTE mappings at stage2 when needed") Reported-by: Eric Auger <eric.auger@redhat.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Chirstoffer Dall <christoffer.dall@arm.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Zheng Xiang <zhengxiang9@huawei.com> Cc: Andrew Murray <andrew.murray@arm.com> Cc: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | | | | | KVM: arm/arm64: Ensure vcpu target is unset on reset failureAndrew Jones2019-04-251-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A failed KVM_ARM_VCPU_INIT should not set the vcpu target, as the vcpu target is used by kvm_vcpu_initialized() to determine if other vcpu ioctls may proceed. We need to set the target before calling kvm_reset_vcpu(), but if that call fails, we should then unset it and clear the feature bitmap while we're at it. Signed-off-by: Andrew Jones <drjones@redhat.com> [maz: Simplified patch, completed commit message] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | | | | | KVM: arm/arm64: vgic-v3: Retire pending interrupts on disabling LPIsMarc Zyngier2019-04-033-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When disabling LPIs (for example on reset) at the redistributor level, it is expected that LPIs that was pending in the CPU interface are eventually retired. Currently, this is not what is happening, and these LPIs will stay in the ap_list, eventually being acknowledged by the vcpu (which didn't quite expect this behaviour). The fix is thus to retire these LPIs from the list of pending interrupts as we disable LPIs. Reported-by: Heyi Guo <guoheyi@huawei.com> Tested-by: Heyi Guo <guoheyi@huawei.com> Fixes: 0e4e82f154e3 ("KVM: arm64: vgic-its: Enable ITS emulation as a virtual MSI controller") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | | | | | KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculationWei Huang2019-03-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently the generic timer test of kvm-unit-tests failed to complete (stalled) when a physical timer is being used. This issue is caused by incorrect update of CNTP_CVAL when CNTP_TVAL is being accessed, introduced by 'Commit 84135d3d18da ("KVM: arm/arm64: consolidate arch timer trap handlers")'. According to Arm ARM, the read/write behavior of accesses to the TVAL registers is expected to be: * READ: TimerValue = (CompareValue – (Counter - Offset) * WRITE: CompareValue = ((Counter - Offset) + Sign(TimerValue) This patch fixes the TVAL read/write code path according to the specification. Fixes: 84135d3d18da ("KVM: arm/arm64: consolidate arch timer trap handlers") Signed-off-by: Wei Huang <wei@redhat.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | | | | | KVM: selftests: make hyperv_cpuid test pass on AMDVitaly Kuznetsov2019-04-301-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enlightened VMCS is only supported on Intel CPUs but the test shouldn't fail completely. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: lapic: Check for in-kernel LAPIC before deferencing apic pointerSean Christopherson2019-04-302-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ...to avoid dereferencing a null pointer when querying the per-vCPU timer advance. Fixes: 39497d7660d98 ("KVM: lapic: Track lapic timer advance per vCPU") Reported-by: syzbot+f7e65445a40d3e0e4ebf@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned sizePaolo Bonzini2019-04-303-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a memory slot's size is not a multiple of 64 pages (256K), then the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages either requires the requested page range to go beyond memslot->npages, or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect requires log->num_pages to be both in range and aligned. To allow this case, allow log->num_pages not to be a multiple of 64 if it ends exactly on the last page of the slot. Reported-by: Peter Xu <peterx@redhat.com> Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | x86/kvm/mmu: reset MMU context when 32-bit guest switches PAEVitaly Kuznetsov2019-04-302-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") introduced a regression: 32-bit PAE guests stopped working. The issue appears to be: when guest switches (enables) PAE we need to re-initialize MMU context (set context->root_level, do reset_rsvds_bits_mask(), ...) but init_kvm_tdp_mmu() doesn't do that because we threw away is_pae(vcpu) flag from mmu role. Restore it to kvm_mmu_extended_role (as we now don't need it in base role) to fix the issue. Fixes: 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: x86: Whitelist port 0x7e for pre-incrementing %ripSean Christopherson2019-04-302-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM's recent bug fix to update %rip after emulating I/O broke userspace that relied on the previous behavior of incrementing %rip prior to exiting to userspace. When running a Windows XP guest on AMD hardware, Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself. Because KVM's old behavior was to increment %rip before exiting to userspace to handle the I/O, Qemu manually adjusted %rip to account for the OUT instruction. Arguably this is a userspace bug as KVM requires userspace to re-enter the kernel to complete instruction emulation before taking any other actions. That being said, this is a bit of a grey area and breaking userspace that has worked for many years is bad. Pre-increment %rip on OUT to port 0x7e before exiting to userspace to hack around the issue. Fixes: 45def77ebf79e ("KVM: x86: update %rip after emulating IO") Reported-by: Simon Becherer <simon@becherer.de> Reported-and-tested-by: Iakov Karpov <srid@rkmail.ru> Reported-by: Gabriele Balducci <balducci@units.it> Reported-by: Antti Antinoja <reader@fennosys.fi> Cc: stable@vger.kernel.org Cc: Takashi Iwai <tiwai@suse.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | Documentation: kvm: fix dirty log ioctl arch listsAndrew Jones2019-04-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM_GET_DIRTY_LOG is implemented by all architectures, not just x86, and KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is additionally implemented by arm, arm64, and mips. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: VMX: Move RSB stuffing to before the first RET after VM-ExitRick Edgecombe2019-04-272-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The not-so-recent change to move VMX's VM-Exit handing to a dedicated "function" unintentionally exposed KVM to a speculative attack from the guest by executing a RET prior to stuffing the RSB. Make RSB stuffing happen immediately after VM-Exit, before any unpaired returns. Alternatively, the VM-Exit path could postpone full RSB stuffing until its current location by stuffing the RSB only as needed, or by avoiding returns in the VM-Exit path entirely, but both alternatives are beyond ugly since vmx_vmexit() has multiple indirect callers (by way of vmx_vmenter()). And putting the RSB stuffing immediately after VM-Exit makes it much less likely to be re-broken in the future. Note, the cost of PUSH/POP could be avoided in the normal flow by pairing the PUSH RAX with the POP RAX in __vmx_vcpu_run() and adding an a POP to nested_vmx_check_vmentry_hw(), but such a weird/subtle dependency is likely to cause problems in the long run, and PUSH/POP will take all of a few cycles, which is peanuts compared to the number of cycles required to fill the RSB. Fixes: 453eafbe65f7 ("KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines") Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: lapic: Convert guest TSC to host time domain if necessarySean Christopherson2019-04-181-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To minimize the latency of timer interrupts as observed by the guest, KVM adjusts the values it programs into the host timers to account for the host's overhead of programming and handling the timer event. In the event that the adjustments are too aggressive, i.e. the timer fires earlier than the guest expects, KVM busy waits immediately prior to entering the guest. Currently, KVM manually converts the delay from nanoseconds to clock cycles. But, the conversion is done in the guest's time domain, while the delay occurs in the host's time domain. This is perfectly ok when the guest and host are using the same TSC ratio, but if the guest is using a different ratio then the delay may not be accurate and could wait too little or too long. When the guest is not using the host's ratio, convert the delay from guest clock cycles to host nanoseconds and use ndelay() instead of __delay() to provide more accurate timing. Because converting to nanoseconds is relatively expensive, e.g. requires division and more multiplication ops, continue using __delay() directly when guest and host TSCs are running at the same ratio. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: lapic: Allow user to disable adaptive tuning of timer advancementSean Christopherson2019-04-183-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The introduction of adaptive tuning of lapic timer advancement did not allow for the scenario where userspace would want to disable adaptive tuning but still employ timer advancement, e.g. for testing purposes or to handle a use case where adaptive tuning is unable to settle on a suitable time. This is epecially pertinent now that KVM places a hard threshold on the maximum advancment time. Rework the timer semantics to accept signed values, with a value of '-1' being interpreted as "use adaptive tuning with KVM's internal default", and any other value being used as an explicit advancement time, e.g. a time of '0' effectively disables advancement. Note, this does not completely restore the original behavior of lapic_timer_advance_ns. Prior to tracking the advancement per vCPU, which is necessary to support autotuning, userspace could adjust lapic_timer_advance_ns for *running* vCPU. With per-vCPU tracking, the module params are snapshotted at vCPU creation, i.e. applying a new advancement effectively requires restarting a VM. Dynamically updating a running vCPU is possible, e.g. a helper could be added to retrieve the desired delay, choosing between the global module param and the per-VCPU value depending on whether or not auto-tuning is (globally) enabled, but introduces a great deal of complexity. The wrapper itself is not complex, but understanding and documenting the effects of dynamically toggling auto-tuning and/or adjusting the timer advancement is nigh impossible since the behavior would be dependent on KVM's implementation as well as compiler optimizations. In other words, providing stable behavior would require extremely careful consideration now and in the future. Given that the expected use of a manually-tuned timer advancement is to "tune once, run many", use the vastly simpler approach of recognizing changes to the module params only when creating a new vCPU. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: lapic: Track lapic timer advance per vCPUSean Christopherson2019-04-185-25/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Automatically adjusting the globally-shared timer advancement could corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting the advancement value. That could be partially fixed by using a local variable for the arithmetic, but it would still be susceptible to a race when setting timer_advance_adjust_done. And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the correct calibration for a given vCPU may not apply to all vCPUs. Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is effectively violated when finding a stable advancement takes an extended amount of timer. Opportunistically change the definition of lapic_timer_advance_ns to a u32 so that it matches the style of struct kvm_timer. Explicitly pass the param to kvm_create_lapic() so that it doesn't have to be exposed to lapic.c, thus reducing the probability of unintentionally using the global value instead of the per-vCPU value. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | | KVM: lapic: Disable timer advancement if adaptive tuning goes haywireSean Christopherson2019-04-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To minimize the latency of timer interrupts as observed by the guest, KVM adjusts the values it programs into the host timers to account for the host's overhead of programming and handling the timer event. Now that the timer advancement is automatically tuned during runtime, it's effectively unbounded by default, e.g. if KVM is running as L1 the advancement can measure in hundreds of milliseconds. Disable timer advancement if adaptive tuning yields an advancement of more than 5000ns, as large advancements can break reasonable assumptions of the guest, e.g. that a timer configured to fire after 1ms won't arrive on the next instruction. Although KVM busy waits to mitigate the case of a timer event arriving too early, complications can arise when shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test will fail when its "host" exits on interrupts as KVM may inject the INTR before the guest executes STI+HLT. Arguably the unit test is "broken" in the sense that delaying a timer interrupt by 1ms doesn't technically guarantee the interrupt will arrive after STI+HLT, but it's a reasonable assumption that KVM should support. Furthermore, an unbounded advancement also effectively unbounds the time spent busy waiting, e.g. if the guest programs a timer with a very large delay. 5000ns is a somewhat arbitrary threshold. When running on bare metal, which is the intended use case, timer advancement is expected to be in the general vicinity of 1000ns. 5000ns is high enough that false positives are unlikely, while not being so high as to negatively affect the host's performance/stability. Note, a future patch will enable userspace to disable KVM's adaptive tuning, which will allow priveleged userspace will to specifying an advancement value in excess of this arbitrary threshold in order to satisfy an abnormal use case. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>