summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-17tracing: Use trace_clock_local() for looping in preemptirq_delay_test.cSteven Rostedt (VMware)
The preemptirq_delay_test module is used for the ftrace selftest code that tests the latency tracers. The problem is that it uses ktime for the delay loop, and then checks the tracer to see if the delay loop is caught, but the tracer uses trace_clock_local() which uses various different other clocks to measure the latency. As ktime uses the clock cycles, and the code then converts that to nanoseconds, it causes rounding errors, and the preemptirq latency tests are failing due to being off by 1 (it expects to see a delay of 500000 us, but the delay is only 499999 us). This is happening due to a rounding error in the ktime (which is totally legit). The purpose of the test is to see if it can catch the delay, not to test the accuracy between trace_clock_local() and ktime_get(). Best to use apples to apples, and have the delay loop use the same clock as the latency tracer does. Cc: stable@vger.kernel.org Fixes: f96e8577da102 ("lib: Add module for testing preemptoff/irqsoff latency tracers") Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-17tracepoint: Fix tracepoint array element size mismatchMathieu Desnoyers
commit 46e0c9be206f ("kernel: tracepoints: add support for relative references") changes the layout of the __tracepoint_ptrs section on architectures supporting relative references. However, it does so without turning struct tracepoint * const into const int elsewhere in the tracepoint code, which has the following side-effect: Setting mod->num_tracepoints is done in by module.c: mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs", sizeof(*mod->tracepoints_ptrs), &mod->num_tracepoints); Basically, since sizeof(*mod->tracepoints_ptrs) is a pointer size (rather than sizeof(int)), num_tracepoints is erroneously set to half the size it should be on 64-bit arch. So a module with an odd number of tracepoints misses the last tracepoint due to effect of integer division. So in the module going notifier: for_each_tracepoint_range(mod->tracepoints_ptrs, mod->tracepoints_ptrs + mod->num_tracepoints, tp_module_going_check_quiescent, NULL); the expression (mod->tracepoints_ptrs + mod->num_tracepoints) actually evaluates to something within the bounds of the array, but miss the last tracepoint if the number of tracepoints is odd on 64-bit arch. Fix this by introducing a new typedef: tracepoint_ptr_t, which is either "const int" on architectures that have PREL32 relocations, or "struct tracepoint * const" on architectures that does not have this feature. Also provide a new tracepoint_ptr_defer() static inline to encapsulate deferencing this type rather than duplicate code and ugly idefs within the for_each_tracepoint_range() implementation. This issue appears in 4.19-rc kernels, and should ideally be fixed before the end of the rc cycle. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Jessica Yu <jeyu@kernel.org> Link: http://lkml.kernel.org/r/20181013191050.22389-1-mathieu.desnoyers@efficios.com Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: James Morris <james.morris@microsoft.com> Cc: James Morris <jmorris@namei.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nicolas Pitre <nico@linaro.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Russell King <linux@armlinux.org.uk> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-17igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller supportSasha Neftin
This patch adds the beginning framework onto which I am going to add the igc driver which supports the Intel(R) I225-LM/I225-V 2.5G Ethernet Controller. Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-10-17spi: sh-msiof: document R8A779{7|8}0 bindingsSergei Shtylyov
Document the R-Car V3{M|H} (R8A779{7|8}0) SoCs in the Renesas MSIOF bindings. Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17ASoC: rsnd: tidyup SSICR::SWSP for TDMKuninori Morimoto
R-Car datasheet is indicating that WS output settings of SSICR::SWSP is inverted on TDM mode from non TDM mode settings. But, it is meaning that TDM should use 0 here. Without this patch, sound input/output 1ch will be 2ch, 2ch will be 3ch ..., be jumbled on I2S + TDM settings. This patch fixup it. This patch is tested on R-Car H3 ulcb-kf board, SSI3/4 TDM sound. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17ASoC: rsnd: enable TDM settings for SSI parentKuninori Morimoto
Some SSIs are sharing each pins (= WS/CLK pin for playback/capture). Then, SSI parent needs control WS/CLK setting for SSI slave. In such case, SSI parent needs TDM settings if SSI slave is working as TDM mode. But it is not cared in current driver. It can't capture TDM sound without this patch if SSIs were pin sharing. This patch is tested on R-Car H3 ulcb-kf board, SSI3/4 with TDM sound. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17ASoC: pcm3168a: add hw constraint for capture channelKuninori Morimoto
LEFT_J / I2S only can use TDM. commit 594680ea4a394 ("ASoC: pcm3168a: add hw constraint for channel") commit 3809688980205 ("ASoC: pcm3168a: add HW constraint for non RIGHT_J") added channel constraint for it, but, it was only for playback. This patch adds constraint for capture. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17ASoC: sta32x: Add support for XTI clockDaniel Mack
The STA32x chips feature an XTI clock input that needs to be stable before the reset signal is released. Therefore, the chip driver needs to get a handle to the clock. Instead of relying on other parts of the system to enable the clock, let the codec driver grab a handle itself. In order to keep existing boards working, clock support is made optional. Signed-off-by: Daniel Mack <daniel@zonque.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17usb: gadget: storage: Fix Spectre v1 vulnerabilityGustavo A. R. Silva
num can be indirectly controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: drivers/usb/gadget/function/f_mass_storage.c:3177 fsg_lun_make() warn: potential spectre issue 'fsg_opts->common->luns' [r] (local cap) Fix this by sanitizing num before using it to index fsg_opts->common->luns Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Felipe Balbi <felipe.balbi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-17perf tools: Stop fallbacking to kallsyms for vdso symbols lookupArnaldo Carvalho de Melo
David reports that: <quote> Perf has this hack where it uses the kernel symbol map as a backup when a symbol can't be found in the user's symbol table(s). This causes problems because the tests driving this code path use machine__kernel_ip(), and that is completely meaningless on Sparc. On sparc64 the kernel and user live in physically separate virtual address spaces, rather than a shared one. And the kernel lives at a virtual address that overlaps common userspace addresses. So this test passes almost all the time when a user symbol lookup fails. The consequence of this is that, if the unfound user virtual address in the sample doesn't match up to a kernel symbol either, we trigger things like this code in builtin-top.c: if (al.sym == NULL && al.map != NULL) { const char *msg = "Kernel samples will not be resolved.\n"; /* * As we do lazy loading of symtabs we only will know if the * specified vmlinux file is invalid when we actually have a * hit in kernel space and then try to load it. So if we get * here and there are _no_ symbols in the DSO backing the * kernel map, bail out. * * We may never get here, for instance, if we use -K/ * --hide-kernel-symbols, even if the user specifies an * invalid --vmlinux ;-) */ if (!machine->kptr_restrict_warned && !top->vmlinux_warned && __map__is_kernel(al.map) && map__has_symbols(al.map)) { if (symbol_conf.vmlinux_name) { char serr[256]; dso__strerror_load(al.map->dso, serr, sizeof(serr)); ui__warning("The %s file can't be used: %s\n%s", symbol_conf.vmlinux_name, serr, msg); } else { ui__warning("A vmlinux file was not found.\n%s", msg); } if (use_browser <= 0) sleep(5); top->vmlinux_warned = true; } } When I fire up a compilation on sparc, this triggers immediately. I'm trying to figure out what the "backup to kernel map" code is accomplishing. I see some language in the current code and in the changes that have happened in this area talking about vdso. Does that really happen? The vdso is mapped into userspace virtual addresses, not kernel ones. More history. This didn't cause problems on sparc some time ago, because the kernel IP check used to be "ip < 0" :-) Sparc kernel addresses are not negative. But now with machine__kernel_ip(), which works using the symbol table determined kernel address range, it does trigger. What it all boils down to is that on architectures like sparc, machine__kernel_ip() should always return false in this scenerio, and therefore this kind of logic: if (cpumode == PERF_RECORD_MISC_USER && machine && mg != &machine->kmaps && machine__kernel_ip(machine, al->addr)) { is basically invalid. PERF_RECORD_MISC_USER implies no kernel address can possibly match for the sample/event in question (no matter how hard you try!) :-) </> So, I thought something had changed and in the past we would somehow find that address in the kallsyms, but I couldn't find anything to back that up, the patch introducing this is over a decade old, lots of things changed, so I was just thinking I was missing something. I tried a gtod busy loop to generate vdso activity and added a 'perf probe' at that branch, on x86_64 to see if it ever gets hit: Made thread__find_map() noinline, as 'perf probe' in lines of inline functions seems to not be working, only at function start. (Masami?) # perf probe -x ~/bin/perf -L thread__find_map:57 <thread__find_map@/home/acme/git/perf/tools/perf/util/event.c:57> 57 if (cpumode == PERF_RECORD_MISC_USER && machine && 58 mg != &machine->kmaps && 59 machine__kernel_ip(machine, al->addr)) { 60 mg = &machine->kmaps; 61 load_map = true; 62 goto try_again; } } else { /* * Kernel maps might be changed when loading * symbols so loading * must be done prior to using kernel maps. */ 69 if (load_map) 70 map__load(al->map); 71 al->addr = al->map->map_ip(al->map, al->addr); # perf probe -x ~/bin/perf thread__find_map:60 Added new event: probe_perf:thread__find_map (on thread__find_map:60 in /home/acme/bin/perf) You can now use it in all perf tools, such as: perf record -e probe_perf:thread__find_map -aR sleep 1 # Then used this to see if, system wide, those probe points were being hit: # perf trace -e *perf:thread*/max-stack=8/ ^C[root@jouet ~]# No hits when running 'perf top' and: # cat gtod.c #include <sys/time.h> int main(void) { struct timeval tv; while (1) gettimeofday(&tv, 0); return 0; } [root@jouet c]# ./gtod ^C Pressed 'P' in 'perf top' and the [vdso] samples are there: 62.84% [vdso] [.] __vdso_gettimeofday 8.13% gtod [.] main 7.51% [vdso] [.] 0x0000000000000914 5.78% [vdso] [.] 0x0000000000000917 5.43% gtod [.] _init 2.71% [vdso] [.] 0x000000000000092d 0.35% [kernel] [k] native_io_delay 0.33% libc-2.26.so [.] __memmove_avx_unaligned_erms 0.20% [vdso] [.] 0x000000000000091d 0.17% [i2c_i801] [k] i801_access 0.06% firefox [.] free 0.06% libglib-2.0.so.0.5400.3 [.] g_source_iter_next 0.05% [vdso] [.] 0x0000000000000919 0.05% libpthread-2.26.so [.] __pthread_mutex_lock 0.05% libpixman-1.so.0.34.0 [.] 0x000000000006d3a7 0.04% [kernel] [k] entry_SYSCALL_64_trampoline 0.04% libxul.so [.] style::dom_apis::query_selector_slow 0.04% [kernel] [k] module_get_kallsym 0.04% firefox [.] malloc 0.04% [vdso] [.] 0x0000000000000910 I added a 'perf probe' to thread__find_map:69, and that surely got tons of hits, i.e. for every map found, just to make sure the 'perf probe' command was really working. In the process I noticed a bug, we're only have records for '[vdso]' for pre-existing commands, i.e. ones that are running when we start 'perf top', when we will generate the PERF_RECORD_MMAP by looking at /perf/PID/maps. I.e. like this, for preexisting processes with a vdso map, again, tracing for all the system, only pre-existing processes get a [vdso] map (when having one): [root@jouet ~]# perf probe -x ~/bin/perf __machine__addnew_vdso Added new event: probe_perf:__machine__addnew_vdso (on __machine__addnew_vdso in /home/acme/bin/perf) You can now use it in all perf tools, such as: perf record -e probe_perf:__machine__addnew_vdso -aR sleep 1 [root@jouet ~]# perf trace -e probe_perf:__machine__addnew_vdso/max-stack=8/ 0.000 probe_perf:__machine__addnew_vdso:(568eb3) __machine__addnew_vdso (/home/acme/bin/perf) map__new (/home/acme/bin/perf) machine__process_mmap2_event (/home/acme/bin/perf) machine__process_event (/home/acme/bin/perf) perf_event__process (/home/acme/bin/perf) perf_tool__process_synth_event (/home/acme/bin/perf) perf_event__synthesize_mmap_events (/home/acme/bin/perf) __event__synthesize_thread (/home/acme/bin/perf) The kernel is generating a PERF_RECORD_MMAP for vDSOs, but somehow 'perf top' is not getting those records while 'perf record' is: # perf record ~acme/c/gtod ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.076 MB perf.data (1499 samples) ] # perf report -D | grep PERF_RECORD_MMAP2 71293612401913 0x11b48 [0x70]: PERF_RECORD_MMAP2 25484/25484: [0x400000(0x1000) @ 0 fd:02 1137 541179306]: r-xp /home/acme/c/gtod 71293612419012 0x11be0 [0x70]: PERF_RECORD_MMAP2 25484/25484: [0x7fa4a2783000(0x227000) @ 0 fd:00 3146370 854107250]: r-xp /usr/lib64/ld-2.26.so 71293612432110 0x11c50 [0x60]: PERF_RECORD_MMAP2 25484/25484: [0x7ffcdb53a000(0x2000) @ 0 00:00 0 0]: r-xp [vdso] 71293612509944 0x11cb0 [0x70]: PERF_RECORD_MMAP2 25484/25484: [0x7fa4a23cd000(0x3b6000) @ 0 fd:00 3149723 262067164]: r-xp /usr/lib64/libc-2.26.so # # perf script | grep vdso | head gtod 25484 71293.612768: 2485554 cycles:ppp: 7ffcdb53a914 [unknown] ([vdso]) gtod 25484 71293.613576: 2149343 cycles:ppp: 7ffcdb53a917 [unknown] ([vdso]) gtod 25484 71293.614274: 1814652 cycles:ppp: 7ffcdb53aca8 __vdso_gettimeofday+0x98 ([vdso]) gtod 25484 71293.614862: 1669070 cycles:ppp: 7ffcdb53acc5 __vdso_gettimeofday+0xb5 ([vdso]) gtod 25484 71293.615404: 1451589 cycles:ppp: 7ffcdb53acc5 __vdso_gettimeofday+0xb5 ([vdso]) gtod 25484 71293.615999: 1269941 cycles:ppp: 7ffcdb53ace6 __vdso_gettimeofday+0xd6 ([vdso]) gtod 25484 71293.616405: 1177946 cycles:ppp: 7ffcdb53a914 [unknown] ([vdso]) gtod 25484 71293.616775: 1121290 cycles:ppp: 7ffcdb53ac47 __vdso_gettimeofday+0x37 ([vdso]) gtod 25484 71293.617150: 1037721 cycles:ppp: 7ffcdb53ace6 __vdso_gettimeofday+0xd6 ([vdso]) gtod 25484 71293.617478: 994526 cycles:ppp: 7ffcdb53ace6 __vdso_gettimeofday+0xd6 ([vdso]) # The patch is the obvious one and with it we also continue to resolve vdso symbols for pre-existing processes in 'perf top' and for all processes in 'perf record' + 'perf report/script'. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David Miller <davem@davemloft.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-cs7skq9pp0kjypiju6o7trse@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-17ASoC: nau8822: new codec driverDavid Lin
Add driver for NAU88C22. Signed-off-by: David Lin <CTLIN0@nuvoton.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17ASoC: tegra_sgtl5000: fix device_node refcountingMarcel Ziswiler
Similar to the following: commit 4321723648b0 ("ASoC: tegra_alc5632: fix device_node refcounting") commit 7c5dfd549617 ("ASoC: tegra: fix device_node refcounting") Signed-off-by: Marcel Ziswiler <marcel.ziswiler@toradex.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2018-10-17tools/testing/nvdimm: Populate dirty shutdown dataDan Williams
Allow the unit tests to verify the retrieval of the dirty shutdown count via smart commands, and allow the driver-load-time retrieval of the smart health payload to be simulated by nfit_test. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-10-17acpi, nfit: Collect shutdown statusDan Williams
Some NVDIMMs, in addition to providing an indication of whether the previous shutdown was clean, also provide a running count of lifetime dirty-shutdown events for the device. In anticipation of this functionality appearing on more devices arrange for the nfit driver to retrieve / cache this data at DIMM discovery time, and export it via sysfs. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-10-17KVM: arm64: Fix caching of host MDCR_EL2 valueMark Rutland
At boot time, KVM stashes the host MDCR_EL2 value, but only does this when the kernel is not running in hyp mode (i.e. is non-VHE). In these cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can lead to CONSTRAINED UNPREDICTABLE behaviour. Since we use this value to derive the MDCR_EL2 value when switching to/from a guest, after a guest have been run, the performance counters do not behave as expected. This has been observed to result in accesses via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant counters, resulting in events not being counted. In these cases, only the fixed-purpose cycle counter appears to work as expected. Fix this by always stashing the host MDCR_EL2 value, regardless of VHE. Cc: Christopher Dall <christoffer.dall@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in HYP") Tested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-10-17nvmet: Optionally use PCI P2P memoryLogan Gunthorpe
Create a configfs attribute in each nvme-fabrics namespace to enable P2P memory use. The attribute may be enabled (with a boolean) or a specific P2P device may be given (with the device's PCI name). When enabled, the namespace will ensure the underlying block device supports P2P and is compatible with any specified P2P device. If no device was specified it will ensure there is compatible P2P memory somewhere in the system. Enabling a namespace with P2P memory will fail with EINVAL (and an appropriate dmesg error) if any of these conditions are not met. Once a controller is set up on a specific port, the P2P device to use for each namespace will be found and stored in a radix tree by namespace ID. When memory is allocated for a request, the tree is used to look up the P2P device to allocate memory against. If no device is in the tree (because no appropriate device was found), or if allocation of P2P memory fails, fall back to using regular memory. Signed-off-by: Stephen Bates <sbates@raithlin.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> [hch: partial rewrite of the initial code] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-10-17nvmet: Introduce helper functions to allocate and free request SGLsLogan Gunthorpe
Add helpers to allocate and free the SGL in a struct nvmet_req: int nvmet_req_alloc_sgl(struct nvmet_req *req) void nvmet_req_free_sgl(struct nvmet_req *req) This will be expanded in a future patch to implement peer-to-peer memory DMAs and should be common with all target drivers. The new helpers are used in nvmet-rdma. Seeing we use req.transfer_len as the length of the SGL it is set earlier and cleared on any error. It also seems to be unnecessary to accumulate the length as the map_sgl functions should only ever be called once per request. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me>
2018-10-17nvme-pci: Add support for P2P memory in requestsLogan Gunthorpe
For P2P requests, we must use the pci_p2pmem_map_sg() function instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
2018-10-17nvme-pci: Use PCI p2pmem subsystem to manage the CMBLogan Gunthorpe
Register the CMB buffer as p2pmem and use the appropriate allocation functions to create and destroy the IO submission queues. If the CMB supports WDS and RDS, publish it for use as P2P memory by other devices. Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB. However, seeing the main use-cases for the CMB is P2P operations, this seems like a reasonable dependency. We drop the __iomem safety on the buffer seeing that, by convention, it's safe to directly access memory mapped by memremap()/devm_memremap_pages(). Architectures where this is not safe will not be supported by memremap() and therefore will not support PCI P2P and have no support for CMB. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-10-17IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()Logan Gunthorpe
In order to use PCI P2P memory the pci_p2pmem_map_sg() function must be called to map the correct PCI bus address. To do this, check the first page in the scatter list to see if it is P2P memory or not. At the moment, scatter lists that contain P2P memory must be homogeneous so if the first page is P2P the entire SGL should be P2P. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-10-17block: Add PCI P2P flag for request queueLogan Gunthorpe
Add QUEUE_FLAG_PCI_P2P, meaning a driver's request queue supports targeting P2P memory. This will be used by P2P providers and orchestrators (in subsequent patches) to ensure block devices can support P2P memory before submitting P2P-backed pages to submit_bio(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk>
2018-10-17PCI/P2PDMA: Add P2P DMA driver writer's documentationLogan Gunthorpe
Add a restructured text file describing how to write drivers with support for P2P DMA transactions. The document describes how to use the APIs that were added in the previous few commits. Also adds an index for the PCI documentation tree even though this is the only PCI document that has been converted to restructured text at this time. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Jonathan Corbet <corbet@lwn.net>
2018-10-17docs-rst: Add a new directory for PCI documentationLogan Gunthorpe
Add a new directory in the driver API guide for PCI-specific documentation. This is in preparation for adding a new PCI P2P DMA driver writers guide which will go in this directory. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Sanyog Kale <sanyog.r.kale@intel.com> Cc: Sagar Dharia <sdharia@codeaurora.org>
2018-10-17PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpersLogan Gunthorpe
Users of the P2PDMA infrastructure will typically need a way for the user to tell the kernel to use P2P resources. Typically this will be a simple on/off boolean operation but sometimes it may be desirable for the user to specify the exact device to use for the P2P operation. Add new helpers for attributes which take a boolean or a PCI device. Any boolean as accepted by strtobool() turn P2P on or off (such as 'y', 'n', '1', '0', etc). Specifying a full PCI device name/BDF will select the specific device. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-10-17PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offsetLogan Gunthorpe
The DMA address used when mapping PCI P2P memory must be the PCI bus address. Thus, introduce pci_p2pmem_map_sg() to map the correct addresses when using P2P memory. Memory mapped in this way does not need to be unmapped and thus if we provided pci_p2pmem_unmap_sg() it would be empty. This breaks the expected balance between map/unmap but was left out as an empty function doesn't really provide any benefit. In the future, if this call becomes necessary it can be added without much difficulty. For this, we assume that an SGL passed to these functions contain all P2P memory or no P2P memory. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-10-17PCI/P2PDMA: Add sysfs group to display p2pmem statsLogan Gunthorpe
Add a sysfs group to display statistics about P2P memory that is registered in each PCI device. Attributes in the group display the total amount of P2P memory, the amount available and whether it is published or not. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-10-17KVM: VMX: enable nested virtualization by defaultPaolo Bonzini
With live migration support and finally a good solution for exception event injection, nested VMX should be ready for having a stable userspace ABI. The results of syzkaller fuzzing are not perfect but not horrible either (and might be partially due to running on GCE, so that effectively we're testing three-level nesting on a fork of upstream KVM!). Enabling it by default seems like a nice way to conclude the 4.20 pull request. :) Unfortunately, enabling nested SVM in 2009 (commit 4b6e4dca701) was a bit premature. However, until live migration support is in place we can reasonably expect that it does not offer much in terms of ABI guarantees. Therefore we are still in time to break things and conform as much as possible to the interface used for VMX. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Celebrated-by: Liran Alon <liran.alon@oracle.com> Celebrated-by: Wanpeng Li <kernellwp@gmail.com> Celebrated-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM/x86: Use 32bit xor to clear registers in svm.cUros Bizjak
x86_64 zero-extends 32bit xor operation to a full 64bit register. Also add a comment and remove unnecessary instruction suffix in vmx.c Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOADJim Mattson
This is a per-VM capability which can be enabled by userspace so that the faulting linear address will be included with the information about a pending #PF in L2, and the "new DR6 bits" will be included with the information about a pending #DB in L2. With this capability enabled, the L1 hypervisor can now intercept #PF before CR2 is modified. Under VMX, the L1 hypervisor can now intercept #DB before DR6 and DR7 are modified. When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should generally provide an appropriate payload when injecting a #PF or #DB exception via KVM_SET_VCPU_EVENTS. However, to support restoring old checkpoints, this payload is not required. Note that bit 16 of the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. This is the reverse of DR6.RTM, which is cleared in this scenario. This capability also enables exception.pending in struct kvm_vcpu_events, which allows userspace to distinguish between pending and injected exceptions. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17kvm: vmx: Defer setting of DR6 until #DB deliveryJim Mattson
When exception payloads are enabled by userspace (which is not yet possible) and a #DB is raised in L2, defer the setting of DR6 until later. Under VMX, this allows the L1 hypervisor to intercept the fault before DR6 is modified. Under SVM, DR6 is modified before L1 can intercept the fault (as has always been the case with DR7). Note that the payload associated with a #DB exception includes only the "new DR6 bits." When the payload is delievered, DR6.B0-B3 will be cleared and DR6.RTM will be set prior to merging in the new DR6 bits. Also note that bit 16 in the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. Though the reverse of DR6.RTM, this makes the #DB payload field compatible with both the pending debug exceptions field under VMX and the exit qualification for #DB exceptions under VMX. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17kvm: x86: Defer setting of CR2 until #PF deliveryJim Mattson
When exception payloads are enabled by userspace (which is not yet possible) and a #PF is raised in L2, defer the setting of CR2 until the #PF is delivered. This allows the L1 hypervisor to intercept the fault before CR2 is modified. For backwards compatibility, when exception payloads are not enabled by userspace, kvm_multiple_exception modifies CR2 when the #PF exception is raised. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17kvm: x86: Add payload operands to kvm_multiple_exceptionJim Mattson
kvm_multiple_exception now takes two additional operands: has_payload and payload, so that updates to CR2 (and DR6 under VMX) can be delayed until the exception is delivered. This is necessary to properly emulate VMX or SVM hardware behavior for nested virtualization. The new behavior is triggered by vcpu->kvm->arch.exception_payload_enabled, which will (later) be set by a new per-VM capability, KVM_CAP_EXCEPTION_PAYLOAD. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17kvm: x86: Add exception payload fields to kvm_vcpu_eventsJim Mattson
The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a later commit) adds the following fields to struct kvm_vcpu_events: exception_has_payload, exception_payload, and exception.pending. With this capability set, all of the details of vcpu->arch.exception, including the payload for a pending exception, are reported to userspace in response to KVM_GET_VCPU_EVENTS. With this capability clear, the original ABI is preserved, and the exception.injected field is set for either pending or injected exceptions. When userspace calls KVM_SET_VCPU_EVENTS with KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer translated to exception.pending. KVM_SET_VCPU_EVENTS can now only establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17IB/mlx5: Add support for extended atomic operationsYonatan Cohen
Extended atomic operations cmp&swp and fetch&add is a Mellanox feature extending the standard atomic operation to use, varied operand sizes, as apposed to normal atomic operation that use an 8 byte operand only. Extended atomics allows masking the results and arguments. This patch configures QP to support extended atomic operation with the maximum size possible, as exposed by HCA capabilities. Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-linusJens Axboe
Pull single NVMe fix from Christoph. * 'nvme-4.19' of git://git.infradead.org/nvme: nvme: remove ns sibling before clearing path
2018-10-17RDMA/core: Fix comment for hw stats init for port == 0Parav Pandit
When add_port() is done for port == 0, it indicates that ports hardware counters initialization should be skipped. Reflect so in the comment. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17RDMA/core: Refactor ib_register_device() functionParav Pandit
ib_register_device() does several allocation and initialization steps. Split it into smaller more readable functions for easy review and maintenance. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17RDMA/core: Fix unwinding flow in case of error to register deviceParav Pandit
If port pkey list initialization fails, free the port_immutable memory during cleanup path. Currently it is missed out. If cache setup fails, free the pkey list during cleanup path. Fixes: d291f1a65 ("IB/core: Enforce PKey security on QPs") Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17ib_srp: Remove WARN_ON in srp_terminate_io()Hannes Reinecke
The WARN_ON() is pointless as the rport is placed in SDEV_TRANSPORT_OFFLINE at that time, so no new commands can be submitted via srp_queuecommand() Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.com> Acked-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17drivers/block: Remove DAC960 driverHannes Reinecke
The DAC960 driver has been obsoleted by the myrb/myrs drivers, so it can be dropped. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-17IB/mlx5: Allow scatter to CQE without global signaled WRsYonatan Cohen
Requester scatter to CQE is restricted to QPs configured to signal all WRs. This patch adds ability to enable scatter to cqe (force enable) in the requester without sig_all, for users who do not want all WRs signaled but rather just the ones whose data found in the CQE. Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17IB/mlx5: Verify that driver supports user flagsYonatan Cohen
Flags sent down from user might not be supported by running driver. This might lead to unwanted bugs. To solve this, added macro to test for unsupported flags. Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17IB/mlx5: Support scatter to CQE for DC transport typeYonatan Cohen
Scatter to CQE is a HW offload that saves PCI writes by scattering the payload to the CQE. This patch extends already existing functionality to support DC transport type. Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17Merge remote-tracking branch 'mlx5-next' into for-nextDoug Ledford
Pick up changes to mlx5_ifc.h needed for direct Scatter to CQE support series to come next. Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-10-17parisc: Add alternative coding infrastructureHelge Deller
This patch adds the necessary code to patch a running kernel at runtime to improve performance. The current implementation offers a few optimizations variants: - When running a SMP kernel on a single UP processor, unwanted assembler statements like locking functions are overwritten with NOPs. When multiple instructions shall be skipped, one branch instruction is used instead of multiple nop instructions. - In the UP case, some pdtlb and pitlb instructions are patched to become pdtlb,l and pitlb,l which only flushes the CPU-local tlb entries instead of broadcasting the flush to other CPUs in the system and thus may improve performance. - fic and fdc instructions are skipped if no I- or D-caches are installed. This should speed up qemu emulation and cacheless systems. - If no cache coherence is needed for IO operations, the relevant fdc and sync instructions in the sba and ccio drivers are replaced by nops. - On systems which share I- and D-TLBs and thus don't have a seperate instruction TLB, the pitlb instruction is replaced by a nop. Live-patching is done early in the boot process, just after having run the system inventory. No drivers are running and thus no external interrupts should arrive. So the hope is that no TLB exceptions will occur during the patching. If this turns out to be wrong we will probably need to do the patching in real-mode. Signed-off-by: Helge Deller <deller@gmx.de>
2018-10-17PCI: mediatek: Add loadable kernel module supportHonghui Zhang
Implement remove() callback function for the Mediatek PCIe controller driver to add loadable kernel module support. Save the PCIe's GIC IRQ at probe so that it can be retrieved to call dispose_irq() to tear down the IRQ upon module removal. Signed-off-by: Honghui Zhang <honghui.zhang@mediatek.com> [lorenzo.pieralisi@arm.com: updated commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Ryder Lee <ryder.lee@mediatek.com>
2018-10-17PCI: mediatek: Add system PM support for MT2712 and MT7622Honghui Zhang
In order to reduce the PCIe power consumption in system suspend, the PCI bus physical layer should be gated. On system resume, the PCIe link should be re-established and the related control register values should be restored. Define suspend_noirq & resume_noirq callback functions to implement PM system syspend hooks for the PCI host controller. Signed-off-by: Honghui Zhang <honghui.zhang@mediatek.com> [lorenzo.pieralisi@arm.com: updated commit log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Ryder Lee <ryder.lee@mediatek.com>
2018-10-17PCI: mediatek: Fixup MSI enablement logic by enabling MSI before clocksHonghui Zhang
Commit 43e6409db64d ("PCI: mediatek: Add MSI support for MT2712 and MT7622") added MSI support but enabled MSI in the wrong place, at a step in the probe sequence where clocks were not still enabled. Fix this issue by calling mtk_pcie_enable_msi() in mtk_pcie_startup_port_v2() since clocks are enabled when mtk_pcie_startup_port_v2() is called. To avoid forward declaration of mtk_pcie_enable_msi(), move the mtk_pcie_startup_port_v2() function definition in the file. Fixes: 43e6409db64d ("PCI: mediatek: Add MSI support for MT2712 and MT7622") Signed-off-by: Honghui Zhang <honghui.zhang@mediatek.com> [lorenzo.pieralisi@arm.com: squashed commit and adapted log] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Ryder Lee <ryder.lee@mediatek.com>
2018-10-17PCI: mediatek: Convert to use pci_host_probe()Honghui Zhang
Part of mtk_pcie_register_host() is an open-coded version of pci_host_probe(). So instead of duplicating this code, use pci_host_probe() directly and remove mtk_pcie_register_host(). Signed-off-by: Honghui Zhang <honghui.zhang@mediatek.com> [lorenzo.pieralisi@arm.com: commit log changes] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Ryder Lee <ryder.lee@mediatek.com>
2018-10-17PCI: mediatek: Remove the redundant dev->pm_domain checkHonghui Zhang
There is no need to check whether device have a PM domain attached before calling the PM runtime methods. Remove it. Signed-off-by: Honghui Zhang <honghui.zhang@mediatek.com> [lorenzo.pieralisi@arm.com: commit log changes] Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Ryder Lee <ryder.lee@mediatek.com>