summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-01-06perf/x86: Fix filter_events() bug with event mappingsStephane Eranian
This patch fixes a bug in the filter_events() function. The patch fixes the bug whereby if some mappings did not exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it in the attrs array would disappear from the published list of events in /sys/devices/cpu/events. This could be verified easily on any system post SNB (which do not publish STALLED_CYCLES_FRONTEND): $ ./perf stat -e cycles,ref-cycles true Performance counter stats for 'true': 1,217,348 cycles <not supported> ref-cycles The problem is that in filter_events() there is an assumption that the argument (attrs) is organized in increasing continuous event indexes related to the event_map(). But if we remove the non-supported events by shifing the position in the array, then the lookup x86_pmu.event_map() needs to compensate for it, otherwise we are looking up the wrong index. This patch corrects this problem by compensating for the deleted events and with that ref-cycles reappears (here shown on Haswell): $ perf stat -e ref-cycles,cycles true Performance counter stats for 'true': 4,525,910 ref-cycles 1,064,920 cycles 0.002943888 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: jolsa@kernel.org Cc: kan.liang@intel.com Fixes: 8300daa26755 ("perf/x86: Filter out undefined events from sysfs events attribute") Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Use INST_RETIRED.PREC_DIST for cycles: pppAndi Kleen
Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as base. The basic mechanism of abusing the inverse cmask to get all cycles works the same as before. PREC_DIST is available on Sandy Bridge or later. It had some problems on Sandy Bridge, so we only use it on IvyBridge and later. I tested it on Broadwell and Skylake. PREC_DIST has special support for avoiding shadow effects, which can give better results compare to UOPS_RETIRED. The drawback is that PREC_DIST can only schedule on counter 1, but that is ok for cycle sampling, as there is normally no need to do multiple cycle sampling runs in parallel. It is still possible to run perf top in parallel, as that doesn't use precise mode. Also of course the multiplexing can still allow parallel operation. :pp stays with the previous event. Example: Sample a loop with 10 sqrt with old cycles:pp 0.14 │10: sqrtps %xmm1,%xmm0 <-------------- 9.13 │ sqrtps %xmm1,%xmm0 11.58 │ sqrtps %xmm1,%xmm0 11.51 │ sqrtps %xmm1,%xmm0 6.27 │ sqrtps %xmm1,%xmm0 10.38 │ sqrtps %xmm1,%xmm0 12.20 │ sqrtps %xmm1,%xmm0 12.74 │ sqrtps %xmm1,%xmm0 5.40 │ sqrtps %xmm1,%xmm0 10.14 │ sqrtps %xmm1,%xmm0 10.51 │ ↑ jmp 10 We expect all 10 sqrt to get roughly the sample number of samples. But you can see that the instruction directly after the JMP is systematically underestimated in the result, due to sampling shadow effects. With the new PREC_DIST based sampling this problem is gone and all instructions show up roughly evenly: 9.51 │10: sqrtps %xmm1,%xmm0 11.74 │ sqrtps %xmm1,%xmm0 11.84 │ sqrtps %xmm1,%xmm0 6.05 │ sqrtps %xmm1,%xmm0 10.46 │ sqrtps %xmm1,%xmm0 12.25 │ sqrtps %xmm1,%xmm0 12.18 │ sqrtps %xmm1,%xmm0 5.26 │ sqrtps %xmm1,%xmm0 10.13 │ sqrtps %xmm1,%xmm0 10.43 │ sqrtps %xmm1,%xmm0 0.16 │ ↑ jmp 10 Even with PREC_DIST there is still sampling skid and the result is not completely even, but systematic shadow effects are significantly reduced. The improvements are mainly expected to make a difference in high IPC code. With low IPC it should be similar. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for SkylakeAndi Kleen
I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for cycles:pp. But the event is not documented for Skylake, and has some issues. The recommended replacement for cycles:pp is to use INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The event is not really new, but has been already used by perf before Sandy Bridge for the original cycles:p Note the SDM doesn't document that event either, but it's being documented in the latest version of the event list on: https://download.01.org/perfmon/SKL This patch does: - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to allow cmask=16,inv=1 for cycles:pp - We don't need an extra entry for the base INST_RETIRED event, because it is already covered by the catch-all PEBS table entry. - Switch Skylake to use the Core2 PEBS alias (which is INST_RETIRED.TOTAL_CYCLES_PS) Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Allow zero PEBS status with only single active eventAndi Kleen
Normally we drop PEBS events with a zero status field. But when there is only a single PEBS event active we can assume the PEBS record is for that event. The PEBS buffer is always flushed when PEBS events are disabled, so there is no risk of mishandling state PEBS records this way. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Remove warning for zero PEBS statusAndi Kleen
The recent commit: 75f80859b130 ("perf/x86/intel/pebs: Robustify PEBS buffer drain") causes lots of warnings on different CPUs before Skylake when running PEBS intensive workloads. They can have a zero status field in the PEBS record when PEBS is racing with clearing of GLOBAl_STATUS. This also can cause hangs (it seems there are still problems with printk in NMI). Disable the warning, but still ignore the record. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/core: Collapse more IPI loopsPeter Zijlstra
This patch collapses the two 'hard' cases, which are perf_event_{dis,en}able(). I cannot seem to convince myself the current code is correct. So starting with perf_event_disable(); we don't strictly need to test for event->state == ACTIVE, ctx->is_active is enough. If the event is not scheduled while the ctx is, __perf_event_disable() still does the right thing. Its a little less efficient to IPI in that case, over-all simpler. For perf_event_enable(); the same goes, but I think that's actually broken in its current form. The current condition is: ctx->is_active && event->state == OFF, that means it doesn't do anything when !ctx->active && event->state == OFF. This is wrong, it should still mark the event INACTIVE in that case, otherwise we'll still not try and schedule the event once the context becomes active again. This patch implements the two function using the new event_function_call() and does away with the tricky event->state tests. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying ↵Ingo Molnar
new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/fair: Fix new task's load avg removed from source CPU in ↵Yuyang Du
wake_up_new_task() If a newly created task is selected to go to a different CPU in fork balance when it wakes up the first time, its load averages should not be removed from the source CPU since they are never added to it before. The same is also applicable to a never used group entity. Fix it in remove_entity_load_avg(): when entity's last_update_time is 0, simply return. This should precisely identify the case in question, because in other migrations, the last_update_time is set to 0 after remove_entity_load_avg(). Reported-by: Steve Muckle <steve.muckle@linaro.org> Signed-off-by: Yuyang Du <yuyang.du@intel.com> [peterz: cfs_rq_last_update_time] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/core: Move sched_entity::avg into separate cache lineJiri Olsa
The sched_entity::avg collides with read-mostly sched_entity data. The perf c2c tool showed many read HITM accesses across many CPUs for sched_entity's cfs_rq and my_q, while having at the same time tons of stores for avg. After placing sched_entity::avg into separate cache line, the perf bench sched pipe showed around 20 seconds speedup. NOTE I cut out all perf events except for cycles and instructions from following output. Before: $ perf stat -r 5 perf bench sched pipe -l 10000000 # Running 'sched/pipe' benchmark: # Executed 10000000 pipe operations between two processes Total time: 270.348 [sec] 27.034805 usecs/op 36989 ops/sec ... 245,537,074,035 cycles # 1.433 GHz 187,264,548,519 instructions # 0.77 insns per cycle 272.653840535 seconds time elapsed ( +- 1.31% ) After: $ perf stat -r 5 perf bench sched pipe -l 10000000 # Running 'sched/pipe' benchmark: # Executed 10000000 pipe operations between two processes Total time: 251.076 [sec] 25.107678 usecs/op 39828 ops/sec ... 244,573,513,928 cycles # 1.572 GHz 187,409,641,157 instructions # 0.76 insns per cycle 251.679315188 seconds time elapsed ( +- 0.31% ) Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Don Zickus <dzickus@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449606239-28602-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06x86/fpu: Properly align size in CHECK_MEMBER_AT_END_OF() macroJiri Olsa
The CHECK_MEMBER_AT_END_OF(TYPE, MEMBER) checks whether MEMBER is last member of TYPE by evaluating: offsetof(TYPE::MEMBER) + sizeof(TYPE::MEMBER) == sizeof(TYPE) and ensuring TYPE::MEMBER is the last member of the TYPE. This condition breaks on structs that are padded to be aligned. This patch ensures the TYPE alignment is taken into account. This bug was revealed after adding cacheline alignment into struct sched_entity, which broke task_struct::thread check: CHECK_MEMBER_AT_END_OF(struct task_struct, thread); Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave@sr71.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1450707930-3445-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/deadline: Fix the earliest_dl.next logicWanpeng Li
earliest_dl.next should cache deadline of the earliest ready task that is also enqueued in the pushable rbtree, as pull algorithm uses this information to find candidates for migration: if the earliest_dl.next deadline of source rq is earlier than the earliest_dl.curr deadline of destination rq, the task from the source rq can be pulled. However, current implementation only guarantees that earliest_dl.next is the deadline of the next ready task instead of the next pushable task; which will result in potentially holding both rqs' lock and find nothing to migrate because of affinity constraints. In addition, current logic doesn't update the next candidate for pushing in pick_next_task_dl(), even if the running task is never eligible. This patch fixes both problems by updating earliest_dl.next when pushable dl task is enqueued/dequeued, similar to what we already do for RT. Tested-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449135730-27202-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06Merge branch 'sched/urgent' into sched/core, to pick up fixes before merging ↵Ingo Molnar
new patches Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/core: Reset task's lockless wake-queues on fork()Sebastian Andrzej Siewior
In the following commit: 7675104990ed ("sched: Implement lockless wake-queues") we gained lockless wake-queues. The -RT kernel managed to lockup itself with those. There could be multiple attempts for task X to enqueue it for a wakeup _even_ if task X is already running. The reason is that task X could be runnable but not yet on CPU. The the task performing the wakeup did not leave the CPU it could performe multiple wakeups. With the proper timming task X could be running and enqueued for a wakeup. If this happens while X is performing a fork() then its its child will have a !NULL `wake_q` member copied. This is not a problem as long as the child task does not participate in lockless wakeups :) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 7675104990ed ("sched: Implement lockless wake-queues") Link: http://lkml.kernel.org/r/20151221171710.GA5499@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/core: Fix unserialized r-m-w scribbling stuffPeter Zijlstra
Some of the sched bitfieds (notably sched_reset_on_fork) can be set on other than current, this can cause the r-m-w to race with other updates. Since all the sched bits are serialized by scheduler locks, pull them in a separate word. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: hannes@cmpxchg.org Cc: mhocko@kernel.org Cc: vdavydov@parallels.com Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/core: Check tgid in is_global_init()Sergey Senozhatsky
Our global init task can have sub-threads, so ->pid check is not reliable enough for is_global_init(), we need to check tgid instead. This has been spotted by Oleg and a fix was proposed by Richard a long time ago (see the link below). Oleg wrote: : Because is_global_init() is only true for the main thread of /sbin/init. : : Just look at oom_unkillable_task(). It tries to not kill init. But, say, : select_bad_process() can happily find a sub-thread of is_global_init() : and still kill it. I recently hit the problem in question; re-sending the patch (to the best of my knowledge it has never been submitted) with updated function comment. Credit goes to Oleg and Richard. Suggested-by: Richard Guy Briggs <rgb@redhat.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric W . Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Serge E . Hallyn <serge.hallyn@ubuntu.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://www.redhat.com/archives/linux-audit/2013-December/msg00086.html Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06sched/fair: Fix multiplication overflow on 32-bit systemsAndrey Ryabinin
Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX' on 32-bit systems: UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18 signed integer overflow: 87950 * 47742 cannot be represented in type 'int' The most likely effect of this bug are bad load average numbers resulting in weird scheduling. It's also likely that this can persist for a longer time - until the system goes idle for a long time so that all load avg numbers get reset. [ This is the CFS load average metric, not the procfs output, which is separate. ] Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf: Fix race in swevent hashPeter Zijlstra
There's a race on CPU unplug where we free the swevent hash array while it can still have events on. This will result in a use-after-free which is BAD. Simply do not free the hash array on unplug. This leaves the thing around and no use-after-free takes place. When the last swevent dies, we do a for_each_possible_cpu() iteration anyway to clean these up, at which time we'll free it, so no leakage will occur. Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf: Fix race in perf_event_exec()Peter Zijlstra
I managed to tickle this warning: [ 2338.884942] ------------[ cut here ]------------ [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80() [ 2338.900504] Modules linked in: [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244 [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000 [ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400 [ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00 [ 2338.947987] Call Trace: [ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82 [ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0 [ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20 [ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80 [ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180 [ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0 [ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660 [ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60 [ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200 [ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0 [ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50 [ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5 [ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71 [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]--- Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not what we expected it to be. This is because context switches can swap the task_struct::perf_event_ctxp[] pointer around. Therefore you have to either disable preemption when looking at current, or hold ctx->lock. Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[] before disabling interrupts, therefore a preemption in the right place can swap contexts around and we're using the wrong one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: syzkaller <syzkaller@googlegroups.com> Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06x86/vsdo: Fix build on PARAVIRT_CLOCK=y, KVM_GUEST=nAndy Lutomirski
arch/x86/built-in.o: In function `arch_setup_additional_pages': (.text+0x587): undefined reference to `pvclock_pvti_cpu0_va' KVM_GUEST selects PARAVIRT_CLOCK, so we can make pvclock_pvti_cpu0_va depend on KVM_GUEST. Signed-off-by: Andy Lutomirski <luto@kernel.org> Tested-by: Borislav Petkov <bp@alien8.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/444d38a9bcba832685740ea1401b569861d09a72.1451446564.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-01-06dmaengine: Revert "dmaengine: mic_x100: add missing spin_unlock"Ashutosh Dixit
This reverts commit e958e079e254 ("dmaengine: mic_x100: add missing spin_unlock"). The above patch is incorrect. There is nothing wrong with the original code. The spin_lock is acquired in the "prep" functions and released in "submit". Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2016-01-06net: sched: fix missing free per cpu on qstatsJohn Fastabend
When a qdisc is using per cpu stats (currently just the ingress qdisc) only the bstats are being freed. This also free's the qstats. Fixes: b0ab6f92752b9f9d8 ("net: sched: enable per cpu qstats") Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06ARM: net: bpf: fix zero right shiftRabin Vincent
The LSR instruction cannot be used to perform a zero right shift since a 0 as the immediate value (imm5) in the LSR instruction encoding means that a shift of 32 is perfomed. See DecodeIMMShift() in the ARM ARM. Make the JIT skip generation of the LSR if a zero-shift is requested. This was found using american fuzzy lop. Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06soreuseport: change consume_skb to kfree_skb in error caseCraig Gallek
Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06soreuseport: pass skb to secondary UDP socket lookupCraig Gallek
This socket-lookup path did not pass along the skb in question in my original BPF-based socket selection patch. The skb in the udpN_lib_lookup2 path can be used for BPF-based socket selection just like it is in the 'traditional' udpN_lib_lookup path. udpN_lib_lookup2 kicks in when there are greater than 10 sockets in the same hlist slot. Coincidentally, I chose 10 sockets per reuseport group in my functional test, so the lookup2 path was not excersised. This adds an additional set of tests with 20 sockets. Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Fixes: 3ca8e4029969 ("soreuseport: BPF selection functional test") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-066pack: fix free memory scribblesOne Thousand Gnomes
commit acf673a3187edf72068ee2f92f4dc47d66baed47 fixed a user triggerable free memory scribble but in doing so replaced it with a different one that allows the user to control the data and scribble even more. sixpack_close is called by the tty layer in tty context. The tty context is protected by sp_get() and sp_put(). However network layer activity via sp_xmit() is not protected this way. We must therefore stop the queue otherwise the user gets to dump a buffer mostly of their choice into freed kernel pages. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06mlxsw: pci: Adjust value of CPU egress traffic classIdo Schimmel
During initialization, when creating the send descriptor queues (SDQs), we specify the CPU egress traffic class of each SDQ. The maximum number of classes of this type is different in the two ASICs supported by this PCI driver. New firmware versions check this value is set correctly, which causes errors on the Spectrum ASIC, as its max exposed egress traffic class is lower than 7. Solve this by setting this field to 3, which is an acceptable value for both ASICs. Note that we currently do not expose the QoS capabilities of the ASICs, so setting this to an hardcoded value is OK for now. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06net: filter: make JITs zero A for SKF_AD_ALU_XOR_XRabin Vincent
The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data instructions since it XORs A with X while all the others replace A with some loaded value. All the BPF JITs fail to clear A if this is used as the first instruction in a filter. This was found using american fuzzy lop. Add a helper to determine if A needs to be cleared given the first instruction in a filter, and use this in the JITs. Except for ARM, the rest have only been compile-tested. Fixes: 3480593131e0 ("net: filter: get rid of BPF_S_* enum") Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06Merge tag 'wireless-drivers-next-for-davem-2016-01-05' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== brcfmac * fix IBSS which got broken over time * new USB id for bcm43242 dongle * arp offload configuration through inet notifier ath9k * add random number generator support (CONFIG_ATH9K_HWRNG) iwlwifi * Make scan parameters low latency aware * Fix in the NL80211_FEATURE_FULL_AP_CLIENT_STATE state case * Fix enable injection mode (Chaya Rachel) * Various cleanups (Dan / Julia / myself) * Allow to stay more time on popular channels (David Spinadel) * Bug fixes for D0i3 (Eliad / Luca) * Fixes for GO uAPSD (myself) * Start of TSO support (myself) * Rate control bug fixes (Eyal / Gregory) * Start the work on 9000 devices (Johannes / Sara / Oren) * Start the work on a new Tx queue allocation model (Liad) * Debug infrastructure enhancements (Golan) mwifiex * add a debugfs file for chip reset * advertise SMS4 cipher suite * increase ap and station interface limit to 3 * enable MSI support on newer pcie devices (8897 onwards) rtlwifi * fix lots of module parameter usage ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06net: hns: avoid uninitialized variable warning:Arnd Bergmann
gcc fails to see that the use of the 'last_offset' variable in hns_nic_reuse_page() is used correctly and issues a bogus warning: drivers/net/ethernet/hisilicon/hns/hns_enet.c: In function 'hns_nic_reuse_page': drivers/net/ethernet/hisilicon/hns/hns_enet.c:541:6: warning: 'last_offset' may be used uninitialized in this function [-Wmaybe-uninitialized] This simplifies the function to make it more obvious what is going on to both readers and compilers, which makes the warning go away. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05inet: kill unused skb_free opFlorian Westphal
The only user was removed in commit 029f7f3b8701cc7a ("netfilter: ipv6: nf_defrag: avoid/free clone operations"). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06PM / OPP: Use snprintf() instead of sprintf()Viresh Kumar
sprintf() can access memory outside of the range of the character array, and is risky in some situations. The driver specified prop_name string can be longer than NAME_MAX here (only an attacker will do that though) and so blindly copying it into the character array of size NAME_MAX isn't safe. Instead we must use snprintf() here. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-05mtd: spi-nor: fix stm_is_locked_sr() parametersBrian Norris
stm_is_locked_sr() takes the status register (SR) value as the last parameter, not the second. Reported-by: Bayi Cheng <bayi.cheng@mediatek.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Cc: Bayi Cheng <bayi.cheng@mediatek.com>
2016-01-05mtd: spi-nor: fix Spansion regressions (aliased with Winbond)Brian Norris
Spansion and Winbond have occasionally used the same manufacturer ID, and they don't support the same features. Particularly, writing SR=0 seems to break read access for Spansion's s25fl064k. Unfortunately, we don't currently have a way to differentiate these Spansion and Winbond parts, so rather than regressing support for these Spansion flash, let's drop the new Winbond lock/unlock support for now. We can try to address Winbond support during the next release cycle. Original discussion: http://patchwork.ozlabs.org/patch/549173/ http://patchwork.ozlabs.org/patch/553683/ Fixes: 357ca38d4751 ("mtd: spi-nor: support lock/unlock/is_locked for Winbond") Fixes: c6fc2171b249 ("mtd: spi-nor: disable protection for Winbond flash at startup") Signed-off-by: Brian Norris <computersforpeace@gmail.com> Reported-by: Felix Fietkau <nbd@openwrt.org> Cc: Felix Fietkau <nbd@openwrt.org>
2016-01-05Merge remote-tracking branch 'asoc/fix/intel' into asoc-linusMark Brown
2016-01-05Merge remote-tracking branch 'asoc/fix/rt5645' into asoc-linusMark Brown
2016-01-05Merge remote-tracking branch 'asoc/fix/dapm' into asoc-linusMark Brown
2016-01-05Merge remote-tracking branch 'asoc/fix/arizona' into asoc-linusMark Brown
2016-01-05bridge: Only call /sbin/bridge-stp for the initial network namespaceHannes Frederic Sowa
[I stole this patch from Eric Biederman. He wrote:] > There is no defined mechanism to pass network namespace information > into /sbin/bridge-stp therefore don't even try to invoke it except > for bridge devices in the initial network namespace. > > It is possible for unprivileged users to cause /sbin/bridge-stp to be > invoked for any network device name which if /sbin/bridge-stp does not > guard against unreasonable arguments or being invoked twice on the > same network device could cause problems. [Hannes: changed patch using netns_eq] Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05include/uapi/linux/sockios.h: mark SIOCRTMSG unusedxypron.glpk@gmx.de
IOCTL SIOCRTMSG does nothing but return EINVAL. So comment it as unused. SIOCRTMSG is only used in: * net/ipv4/af_inet.c * include/uapi/linux/sockios.h inet_ioctl calls ip_rt_ioctl. ip_rt_ioctl only handles SIOCADDRT and SIOCDELRT and returns -EINVAL otherwise. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05Merge tag 'trace-v4.4-rc4-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two more fixes: 1. The recordmcount change had an output that used sprintf() (incorrectly) when it should have been a fprintf() to stderr. 2. The printk_formats file could crash if someone added a trace_printk() in the core kernel, and also added one in a module. This does not affect production kernels. Only kernels where developers add trace_printk() for debugging can crash" * tag 'trace-v4.4-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix setting of start_index in find_next() ftrace/scripts: Fix incorrect use of sprintf in recordmcount
2016-01-05Merge branch 'stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile Pull tile bugfix from Chris Metcalf: "This fixes a bug that Sudip's buildbot found for tilepro allmodconfig. I've tagged it for stable only back to 3.19, which was when most of the other affected architectures added their support for working around this issue" * 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: tile: provide CONFIG_PAGE_SIZE_64KB etc for tilepro
2016-01-05Merge branch 'mlx5e-tstamp'David S. Miller
Saeed Mahameed says: ==================== Introduce mlx5 ethernet timestamping This patch series introduces the support for ConnectX-4 timestamping and the PTP kernel interface. Changes from V2: net/mlx5_core: Introduce access function to read internal_timer - Remove one line function - Change function name net/mlx5e: Add HW timestamping (TS) support: - Data path performance optimization (caching tstamp struct in rq,sq) - Change read/write_lock_irqsave to read/write_lock - Move ioctl functions to en_clock file - Changed overflow start algorithm according to comments from Richard - Move timestamp init/cleanup to open/close ndos. In details: 1st patch prevents the driver from modifying skb->data and SKB CB in device xmit function. 2nd patch adds the needed low level helpers for: - Fetching the hardware clock (hardware internal timer) - Parsing CQEs timestamps - Device frequency capability 3rd patch adds new en_clock.c file that handles all needed timestamping operations: - Internal clock structure initialization and other helper functions - Added the needed ioctl for setting/getting the current timestamping configuration. - used this configuration in RX/TX data path to fill the SKB with the timestamp. 4th patch Introduces PTP (PHC) support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05net/mlx5e: Add PTP Hardware Clock (PHC) supportEran Ben Elisha
Add a PHC support to the mlx5_en driver. Use reader/writer spinlocks to protect the timecounter since every packet received needs to call timecounter_cycle2time() when timestamping is enabled. This can become a performance bottleneck with RSS and multiple receive queues if normal spinlocks are used. The driver has been tested with both Documentation/ptp/testptp and the linuxptp project (http://linuxptp.sourceforge.net/) on a Mellanox ConnectX-4 card. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05net/mlx5e: Add HW timestamping (TS) supportEran Ben Elisha
Add support for enable/disable HW timestamping for incoming and/or outgoing packets. To enable/disable HW timestamping appropriate ioctl should be used. Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are supported. Make all relevant changes in RX/TX flows to consider TS request and plant HW timestamps into relevant structures. Add internal clock for converting hardware timestamp to nanoseconds. In addition, add a service task to catch internal clock overflow, to make sure timestamping is accurate. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05net/mlx5_core: Introduce access function to read internal timerEran Ben Elisha
A preparation step which adds support for reading the hardware internal timer and the hardware timestamping from the CQE. In addition, advertize device_frequency_khz HCA capability. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05net/mlx5e: Do not modify the TX SKBAchiad Shochat
If the SKB is cloned, or has an elevated users count, someone else can be looking at it at the same time. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05Merge remote-tracking branches 'regmap/topic/mmio', 'regmap/topic/rbtree' ↵Mark Brown
and 'regmap/topic/seq' into regmap-next
2016-01-05Merge remote-tracking branches 'regmap/topic/64bit' and ↵Mark Brown
'regmap/topic/irq-type' into regmap-next
2016-01-05Merge remote-tracking branch 'regmap/topic/core' into regmap-nextMark Brown
2016-01-05Merge remote-tracking branch 'regmap/topic/cache' into regmap-nextMark Brown