summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-08-25intel_th: Add global activate/deactivate callbacks for the glue layersAlexander Shishkin
A glue layer may want to install its own hooks into trace capture start and stop paths to apply workarounds. This adds optional callbacks. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: pci: Use drvdata for quirksAlexander Shishkin
Allow attaching miscellaneous quirk information to devices as drvdata. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: pci: Add Cannon Lake PCH-LP supportAlexander Shishkin
This adds Intel(R) Trace Hub PCI ID for Cannon Lake PCH-LP. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: <stable@vger.kernel.org>
2017-08-25intel_th: pci: Add Cannon Lake PCH-H supportAlexander Shishkin
This adds Intel(R) Trace Hub PCI ID for Cannon Lake PCH-H. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: <stable@vger.kernel.org>
2017-08-25intel_th: pti: Support Low Power Path output port typeAlexander Shishkin
The Low Power Path (LPP) output port type, looks mostly like PTI to the software, with a few additional bits in the control register. This extends the PTI driver to support LPP ports as well. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: Enumerate Low Power Path output port typeAlexander Shishkin
Trace Hub 2.x adds Low Power Path (LPP) output port type, which provides a low power mode trace path from sources to PTI or BSSB. This adds an output subdevice for the LPP port. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: msu: Use the real device in case of IOMMU domain allocationAlexander Shishkin
When allocating DMA buffers for the MSU, use the real device instead of GTH. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: Make the switch allocate its subdevicesAlexander Shishkin
Instead of allocating devices for every possible output subdevice, allow the switch to allocate only the ones that it knows about. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: Make SOURCE devices children of the root deviceAlexander Shishkin
The switch (GTH) does not directly interact with SOURCE type devices and may not even be present (in host mode). To reflect this and avoid inconsistencies between target and host mode, make SOURCE devices descendant directly from the root (i.e. PCI) device. Their symlinks will no longer appear under the switch device, but they can still be found under intel_th bus. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: Streamline the subdevice tree accessorsAlexander Shishkin
Make to_intel_th*() accessors available from the main header file. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: Output devices without ports don't need assigningAlexander Shishkin
Output subdevices that rely on other output subdevices (or otherwise don't directly talk to an output port on the switch) don't need to be assigned an output port either. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25intel_th: pci: Enable bus masteringAlexander Shishkin
The driver forgets to enable bus mastering for the PCI device. Fix this. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25stm class: Document the stm_ftraceChunyan Zhang
This patch adds a description to the stm_ftrace device source, an interface for collecting Ftrace's function trace information via STM devices. Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25stm: Potential read overflow in stm_char_policy_set_ioctl()Dan Carpenter
The "size" variable comes from the user so we need to verify that it's large enough to hold an stp_policy_id struct. Fixes: 7bd1d4093c2f ("stm class: Introduce an abstraction for System Trace Module devices") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
2017-08-25blk-mq-debugfs: Add names for recently added flagsBart Van Assche
The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to blk-mq-debugfs.c such that these appear as names in debugfs instead of as numbers. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25sched/debug: Optimize sched_domain sysctl generationPeter Zijlstra
Currently we unconditionally destroy all sysctl bits and regenerate them after we've rebuild the domains (even if that rebuild is a no-op). And since we unconditionally (re)build the sysctl for all possible CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to only rebuild the bits for CPUs we've actually installed new domains on. Reported-by: Ofer Levi(SW) <oferle@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25sched/topology: Avoid pointless rebuildPeter Zijlstra
Fix partition_sched_domains() to try and preserve the existing machine wide domain instead of unconditionally destroying it. We do this by attempting to allocate the new single domain, only when that fails to we reuse the fallback_doms. When using fallback_doms we need to first destroy and then recreate because both the old and new could be backed by it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ofer Levi(SW) <oferle@mellanox.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com> Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25sched/topology, cpuset: Avoid spurious/wrong domain rebuildsPeter Zijlstra
When disabling cpuset.sched_load_balance we expect to be able to online CPUs without generating sched_domains. However this is currently completely broken. What happens is that we generate the sched_domains and then destroy them. This is because of the spurious 'default' domain build in cpuset_update_active_cpus(). That builds a single machine wide domain and then schedules a work to build the 'real' domains. The work then finds there are _no_ domains and destroys the lot again. Furthermore, if there actually were cpusets, building the machine wide domain is actively wrong, because it would allow tasks to 'escape' their cpuset. Also I don't think its needed, the scheduler really should respect the active mask. Reported-by: Ofer Levi(SW) <oferle@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com> Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25sched/topology: Improve commentsPeter Zijlstra
Mike provided a better comment for destroy_sched_domain() ... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25sched/topology: Fix memory leak in __sdt_alloc()Shu Wang
Found this issue by kmemleak: the 'sg' and 'sgc' pointers from __sdt_alloc() might be leaked as each domain holds many groups' ref, but in destroy_sched_domain(), it only declined the first group ref. Onlining and offlining a CPU can trigger this leak, and cause OOM. Reproducer for my 6 CPUs machine: while true do echo 0 > /sys/devices/system/cpu/cpu5/online; echo 1 > /sys/devices/system/cpu/cpu5/online; done unreferenced object 0xffff88007d772a80 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00 ."w}............ 40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00 @*w}............ backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88007d772a40 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): 03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 ................ 00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00 ........O<...... backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810da16d>] build_sched_domains+0xead/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Chunyu Hu <chuhu@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: liwang@redhat.com Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()Paul Mackerras
Nixiaoming pointed out that there is a memory leak in kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd() fails; the memory allocated for the kvmppc_spapr_tce_table struct is not freed, and nor are the pages allocated for the iommu tables. In addition, we have already incremented the process's count of locked memory pages, and this doesn't get restored on error. David Hildenbrand pointed out that there is a race in that the function checks early on that there is not already an entry in the stt->iommu_tables list with the same LIOBN, but an entry with the same LIOBN could get added between then and when the new entry is added to the list. This fixes all three problems. To simplify things, we now call anon_inode_getfd() before placing the new entry in the list. The check for an existing entry is done while holding the kvm->lock mutex, immediately before adding the new entry to the list. Finally, on failure we now call kvmppc_account_memlimit to decrement the process's count of locked memory pages. Reported-by: Nixiaoming <nixiaoming@huawei.com> Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25drm: rename u32 in __u32 in uapiLionel Landwerlin
All other fields use __ Cc: Ben Widawsky <ben@bwidawsk.net> Fixes: db1689aa61b ("drm: Create a format/modifier blob") Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Signed-off-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> Reviewed-by: Ben Widawsky <ben@bwidawsk.net> Reviewed-by: Daniel Stone <daniels@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170824150814.5878-1-lionel.g.landwerlin@intel.com
2017-08-25Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25Documentation/locking/atomic: Finish the document...Peter Zijlstra
Julia reported that the document looked unfinished, and it is. I forgot to include the example cooked up by Paul here: https://lkml.kernel.org/r/20170731174345.GL3730@linux.vnet.ibm.com and I added an explicit example showing how, while it is an ACQUIRE pattern, it really does provide an MB. Reported-by: Julia Cartwright <julia@ni.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25locking/lockdep: Fix workqueue crossrelease annotationPeter Zijlstra
The new completion/crossrelease annotations interact unfavourable with the extant flush_work()/flush_workqueue() annotations. The problem is that when a single work class does: wait_for_completion(&C) and complete(&C) in different executions, we'll build dependencies like: lock_map_acquire(W) complete_acquire(C) and lock_map_acquire(W) complete_release(C) which results in the dependency chain: W->C->W, which lockdep thinks spells deadlock, even though there is no deadlock potential since works are ran concurrently. One possibility would be to change the work 'lock' to recursive-read, but that would mean hitting a lockdep limitation on recursive locks. Also, unconditinoally switching to recursive-read here would fail to detect the actual deadlock on single-threaded workqueues, which do have a problem with this. For now, forcefully disregard these locks for crossrelease. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: byungchul.park@lge.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: oleg@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25workqueue/lockdep: 'Fix' flush_work() annotationPeter Zijlstra
The flush_work() annotation as introduced by commit: e159489baa71 ("workqueue: relax lockdep annotation on flush_work()") hits on the lockdep problem with recursive read locks. The situation as described is: Work W1: Work W2: Task: ARR(Q) ARR(Q) flush_workqueue(Q) A(W1) A(W2) A(Q) flush_work(W2) R(Q) A(W2) R(W2) if (special) A(Q) else ARR(Q) R(Q) where: A - acquire, ARR - acquire-read-recursive, R - release. Where under 'special' conditions we want to trigger a lock recursion deadlock, but otherwise allow the flush_work(). The allowing is done by using recursive read locks (ARR), but lockdep is broken for recursive stuff. However, there appears to be no need to acquire the lock if we're not 'special', so if we remove the 'else' clause things become much simpler and no longer need the recursion thing at all. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: byungchul.park@lge.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: oleg@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25locking/lockdep/selftests: Add mixed read-write ABBA testsPeter Zijlstra
Currently lockdep has limited support for recursive readers, add a few mixed read-write ABBA selftests to show the extend of these limitations. [ 0.000000] ---------------------------------------------------------------------------- [ 0.000000] | spin |wlock |rlock |mutex | wsem | rsem | [ 0.000000] -------------------------------------------------------------------------- [ 0.000000] mixed read-lock/lock-write ABBA: |FAILED| | ok | [ 0.000000] mixed read-lock/lock-read ABBA: | ok | | ok | [ 0.000000] mixed write-lock/lock-write ABBA: | ok | | ok | This clearly illustrates the case where lockdep fails to find a deadlock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: byungchul.park@lge.com Cc: david@fromorbit.com Cc: johannes@sipsolutions.net Cc: oleg@redhat.com Cc: tj@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25mm, locking/barriers: Clarify tlb_flush_pending() barriersPeter Zijlstra
Better document the ordering around tlb_flush_pending(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Export some PMU attributes in caps/ directoryAndi Kleen
It can be difficult to figure out for user programs what features the x86 CPU PMU driver actually supports. Currently it requires grepping in dmesg, but dmesg is not always available. This adds a caps directory to /sys/bus/event_source/devices/cpu/, similar to the caps already used on intel_pt, which can be used to discover the available capabilities cleanly. Three capabilities are defined: - pmu_name: Underlying CPU name known to the driver - max_precise: Max precise level supported - branches: Known depth of LBR. Example: % grep . /sys/bus/event_source/devices/cpu/caps/* /sys/bus/event_source/devices/cpu/caps/branches:32 /sys/bus/event_source/devices/cpu/caps/max_precise:3 /sys/bus/event_source/devices/cpu/caps/pmu_name:skylake Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822185201.9261-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Only show format attributes when supportedAndi Kleen
Only show the Intel format attributes in sysfs when the feature is actually supported with the current model numbers. This allows programs to probe what format attributes are available, and give a sensible error message to users if they are not. This handles near all cases for intel attributes since Nehalem, except the (obscure) case when the model number if known, but PEBS is disabled in PERF_CAPABILITIES. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822185201.9261-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25tracing, perf: Adjust code layout in get_recursion_context()Jesper Dangaard Brouer
In an XDP redirect applications using tracepoint xdp:xdp_redirect to diagnose TX overrun, I noticed perf_swevent_get_recursion_context() was consuming 2% CPU. This was reduced to 1.85% with this simple change. Looking at the annotated asm code, it was clear that the unlikely case in_nmi() test was chosen (by the compiler) as the most likely event/branch. This small adjustment makes the compiler (GCC version 7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/core: Don't report zero PIDs for exiting tasksOleg Nesterov
The exiting/dead task has no PIDs and in this case perf_event_pid/tid() return zero, change them to return -1 to distinguish this case from idle threads. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822155928.GA6892@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Fix data source decoding for SkylakeAndi Kleen
Skylake changed the encoding of the PEBS data source field. Some combinations are not available anymore, but some new cases e.g. for L4 cache hit are added. Fix up the conversion table for Skylake, similar as had been done for Nehalem. On Skylake server the encoding for L4 actually means persistent memory. Handle this case too. To properly describe it in the abstracted perf format I had to add some new fields. Since a hit can have only one level add a new field that is an enumeration, not a bit field to describe the level. It can describe any level. Some numbers are also used to describe PMEM and LFB. Also add a new generic remote flag that can be combined with the generic level to signify a remote cache. And there is an extension field for the snoop indication to handle the Forward state. I didn't add a generic flag for hops because it's not needed for Skylake. I changed the existing encodings for older CPUs to also fill in the new level and remote fields. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Move Nehalem PEBS code to flagAndi Kleen
Minor cleanup: use an explicit x86_pmu flag to handle the missing Lock / TLB information on Nehalem, instead of always checking the model number for each PEBS sample. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/aux: Ensure aux_wakeup represents most recent wakeup indexWill Deacon
The aux_watermark member of struct ring_buffer represents the period (in terms of bytes) at which wakeup events should be generated when data is written to the aux buffer in non-snapshot mode. On hardware that cannot generate an interrupt when the aux_head reaches an arbitrary wakeup index (such as ARM SPE), the aux_head sampled from handle->head in perf_aux_output_{skip,end} may in fact be past the wakeup index. This can lead to wakeup slowly falling behind the head. For example, consider the case where hardware can only generate an interrupt on a page-boundary and the aux buffer is initialised as follows: // Buffer size is 2 * PAGE_SIZE rb->aux_head = rb->aux_wakeup = 0 rb->aux_watermark = PAGE_SIZE / 2 following the first perf_aux_output_begin call, the handle is initialised with: handle->head = 0 handle->size = 2 * PAGE_SIZE handle->wakeup = PAGE_SIZE / 2 and the hardware will be programmed to generate an interrupt at PAGE_SIZE. When the interrupt is raised, the hardware head will be at PAGE_SIZE, so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer into the following state: rb->aux_head = PAGE_SIZE rb->aux_wakeup = PAGE_SIZE / 2 rb->aux_watermark = PAGE_SIZE / 2 and then the next call to perf_aux_output_begin will result in: handle->head = handle->wakeup = PAGE_SIZE for which the semantics are unclear and, for a smaller aux_watermark (e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at this point. This patch fixes the problem by rounding down the aux_head (as sampled from the handle) to the nearest aux_watermark boundary when updating rb->aux_wakeup, therefore taking into account any overruns by the hardware. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1502900297-21839-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/aux: Make aux_{head,wakeup} ring_buffer members longWill Deacon
The aux_head and aux_wakeup members of struct ring_buffer are defined using the local_t type, despite the fact that they are only accessed via the perf_aux_output_*() functions, which cannot race with each other for a given ring buffer. This patch changes the type of the members to long, so we can avoid using the local_*() API where it isn't needed. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1502900297-21839-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/core: Fix group {cpu,task} validationMark Rutland
Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhou Chengming <zhouchengming1@huawei.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25x86/mm: Fix use-after-free of ldt_structEric Biggers
The following commit: 39a0526fb3f7 ("x86/mm: Factor out LDT init from context init") renamed init_new_context() to init_new_context_ldt() and added a new init_new_context() which calls init_new_context_ldt(). However, the error code of init_new_context_ldt() was ignored. Consequently, if a memory allocation in alloc_ldt_struct() failed during a fork(), the ->context.ldt of the new task remained the same as that of the old task (due to the memcpy() in dup_mm()). ldt_struct's are not intended to be shared, so a use-after-free occurred after one task exited. Fix the bug by making init_new_context() pass through the error code of init_new_context_ldt(). This bug was found by syzkaller, which encountered the following splat: BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710 CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x24e/0x340 mm/kasan/report.c:409 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] exec_mmap fs/exec.c:1061 [inline] flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291 load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855 search_binary_handler+0x142/0x6b0 fs/exec.c:1652 exec_binprm fs/exec.c:1694 [inline] do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816 do_execve+0x31/0x40 fs/exec.c:1860 call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Allocated by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67 write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277 sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xca/0x250 mm/slab.c:3820 free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] __mmput kernel/fork.c:916 [inline] mmput+0x541/0x6e0 kernel/fork.c:927 copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931 copy_process kernel/fork.c:1546 [inline] _do_fork+0x1ef/0xfb0 kernel/fork.c:2025 SYSC_clone kernel/fork.c:2135 [inline] SyS_clone+0x37/0x50 kernel/fork.c:2129 do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287 return_from_SYSCALL_64+0x0/0x7a Here is a C reproducer: #include <asm/ldt.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *fork_thread(void *_arg) { fork(); } int main(void) { struct user_desc desc = { .entry_number = 8191 }; syscall(__NR_modify_ldt, 1, &desc, sizeof(desc)); for (;;) { if (fork() == 0) { pthread_t t; srand(getpid()); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } Note: the reproducer takes advantage of the fact that alloc_ldt_struct() may use vmalloc() to allocate a large ->entries array, and after commit: 5d17a73a2ebe ("vmalloc: back off when the current task is killed") it is possible for userspace to fail a task's vmalloc() by sending a fatal signal, e.g. via exit_group(). It would be more difficult to reproduce this bug on kernels without that commit. This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v4.6+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: 39a0526fb3f7 ("x86/mm: Factor out LDT init from context init") Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.statePaolo Bonzini
The host pkru is restored right after vcpu exit (commit 1be0e61), so KVM_GET_XSAVE will return the host PKRU value instead. Fix this by using the guest PKRU explicitly in fill_xsave and load_xsave. This part is based on a patch by Junkang Fu. The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state, because it could have been changed by userspace since the last time it was saved, so skip loading it in kvm_load_guest_fpu. Reported-by: Junkang Fu <junkang.fjk@alibaba-inc.com> Cc: Yang Zhang <zy107165@alibaba-inc.com> Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25KVM: x86: simplify handling of PKRUPaolo Bonzini
Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a simple comparison against the host value of the register. The write of PKRU in addition can be skipped if the guest has not enabled the feature. Once we do this, we need not test OSPKE in the host anymore, because guest_CR4.PKE=1 implies host_CR4.PKE=1. The static PKU test is kept to elide the code on older CPUs. Suggested-by: Yang Zhang <zy107165@alibaba-inc.com> Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25KVM: x86: block guest protection keys unless the host has them enabledPaolo Bonzini
If the host has protection keys disabled, we cannot read and write the guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0. Block the PKU cpuid bit in that case. This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1. Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25esp: Fix skb tailroom calculationSteffen Klassert
We use skb_availroom to calculate the skb tailroom for the ESP trailer. skb_availroom calculates the tailroom and subtracts this value by reserved_tailroom. However reserved_tailroom is a union with the skb mark. This means that we subtract the tailroom by the skb mark if set. Fix this by using skb_tailroom instead. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-08-25esp: Fix locking on page fragment allocationSteffen Klassert
We allocate the page fragment for the ESP trailer inside a spinlock, but consume it outside of the lock. This is racy as some other cou could get the same page fragment then. Fix this by consuming the page fragment inside the lock too. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-08-25Merge branch 'x86/asm' into x86/apicThomas Gleixner
Pick up dependent changes to avoid merge conflicts
2017-08-25drm/exynos: simplify set_pixfmt() in DECON and FIMD driversTobias Jakobi
DRM core already checks the validity of the pixelformat. Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de> Signed-off-by: Inki Dae <inki.dae@samsung.com>
2017-08-25drm/exynos: consistent use of cppTobias Jakobi
A recent commit (272725c7db4da1fd3229d944fc76d2e98e3a144e) has removed the use of 'bits_per_pixel' in DRM. However the corresponding Exynos driver code still uses the ambiguous 'bpp', even though it is now initialized from fb->cpp[0]. Consistenly use 'cpp' in FIMD, DECON7 and DECON5433 drivers. Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
2017-08-24netvsc: fix deadlock betwen link status and removalstephen hemminger
There is a deadlock possible when canceling the link status delayed work queue. The removal process is run with RTNL held, and the link status callback is acquring RTNL. Resolve the issue by using trylock and rescheduling. If cancel is in process, that block it from happening. Fixes: 122a5f6410f4 ("staging: hv: use delayed_work for netvsc_send_garp()") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24Merge branch 'tipc-buffer-reassignment-fixes'David S. Miller
Parthasarathy Bhuvaragan says: ==================== tipc: buffer reassignment fixes This series contains fixes for buffer reassignments and a context imbalance. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>