summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-08-18Merge tag 'pm-4.8-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "More hibernation-related material: one fix for a recent regression in the core, one small cleanup of the x86-64 resume code and a documentation update. Specifics: - Fix a hibernate core regression resulting from uncovering a latent bug in its implementation of memory bitmaps by a recent commit (James Morse). - Use __pa() to compute a physical address in the x86-64 code finalizing resume from hibernation (Rafael Wysocki). - Update power management documentation related to system sleep states to remove outdated information from it and to add a description of a recently introduced hibernation debug feature to it (Rafael Wysocki)" * tag 'pm-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / hibernate: Fix rtree_next_node() to avoid walking off list ends x86/power/64: Use __pa() for physical address computation PM / sleep: Update some system sleep documentation
2016-08-18locking/rwsem: Scan the wait_list for readers only onceDavidlohr Bueso
When wanting to wakeup readers, __rwsem_mark_wakeup() currently iterates the wait_list twice while looking to wakeup the first N queued reader-tasks. While this can be quite inefficient, it was there such that a awoken reader would be first and foremost acknowledged by the lock counter. Keeping the same logic, we can further benefit from the use of wake_qs and avoid entirely the first wait_list iteration that sets the counter as wake_up_process() isn't going to occur right away, and therefore we maintain the counter->list order of going about things. Other than saving cycles with O(n) "scanning", this change also nicely cleans up a good chunk of __rwsem_mark_wakeup(); both visually and less tedious to read. For example, the following improvements where seen on some will it scale microbenchmarks, on a 48-core Haswell: v4.7 v4.7-rwsem-v1 Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%) Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%) Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%) Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%) Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%) Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%) Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%) Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%) Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%) Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%) Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%) Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%) Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%) Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%) Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%) Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%) Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%) Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%) Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%) Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%) Other runs that saw improvements include context_switch and pipe; and as expected, this is particularly highlighted on larger thread counts as it becomes more expensive to walk the list twice. No change in wakeup ordering or semantics. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: dave@stgolabs.net Cc: jason.low2@hpe.com Cc: wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18locking/rwsem: Remove a few useless commentsDavidlohr Bueso
Our rwsem code (xadd, at least) is rather well documented, but there are a few really annoying comments in there that serve no purpose and we shouldn't bother with them. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: dave@stgolabs.net Cc: jason.low2@hpe.com Cc: wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18locking/rwsem: Return void in __rwsem_mark_wake()Davidlohr Bueso
We currently return a rw_semaphore structure, which is the same lock we passed to the function's argument in the first place. While there are several functions that choose this return value, the callers use it, for example, for things like ERR_PTR. This is not the case for __rwsem_mark_wake(), and in addition this function is really about the lock waiters (which we know there are at this point), so its somewhat odd to be returning the sem structure. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: dave@stgolabs.net Cc: jason.low2@hpe.com Cc: wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()Peter Zijlstra
The current percpu-rwsem read side is entirely free of serializing insns at the cost of having a synchronize_sched() in the write path. The latency of the synchronize_sched() is too high for cgroups. The commit 1ed1328792ff talks about the write path being a fairly cold path but this is not the case for Android which moves task to the foreground cgroup and back around binder IPC calls from foreground processes to background processes, so it is significantly hotter than human initiated operations. Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the problem, hopefully it should not be that slow after another commit: 80127a39681b ("locking/percpu-rwsem: Optimize readers and reduce global impact"). We could just add rcu_sync_enter() into cgroup_init() but we do not want another synchronize_sched() at boot time, so this patch adds the new helper which doesn't block but currently can only be called before the first use. Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Dmitry Shmidt <dimitrysh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Colin Cross <ccross@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rom Lemarchand <romlem@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Link: http://lkml.kernel.org/r/20160811165413.GA22807@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/cputime: Improve scalability by not accounting thread group tasks ↵Stanislaw Gruszka
pending runtime Commit: d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") started accounting thread group tasks pending runtime in thread_group_cputime(). Another commit: 6e998916dfe32 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") updated scheduler runtime statistics (call update_curr()) when reading task pending runtime. Those changes cause bad performance of SYS_times() and SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on larger systems with many CPUs. While we would like to have cpuclock monotonicity kept i.e. have problems fixed by above commits stay fixed, we also would like to have good performance. However when we notice that change from commit d670ec13178d0 is not longer needed to solve problem addressed by that commit, because of change from the second commit 6e998916dfe32, we can get room for optimization. Since we update task while reading it's pending runtime in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will see updated values and on testcase from d670ec13178d0 process cpuclock will not be smaller than thread cpuclock. I tested the patch on testcases from commits d670ec13178d0, 6e998916dfe32 and some other cpuclock/cputimers testcases and did not found cpuclock monotonicity problems or other malfunction. This patch has the drawback that we will not provide thread group cputime up-to-date to the last moment. For example when arming cputime timer, we will arm it with possibly a bit outdated values and that timer will trigger earlier compared to behaviour without the patch. However that was the behaviour before d670ec13178d0 commit (kernel v3.1) so it's unlikely to affect applications. Patch improves related syscall performance, as measured by Giovanni's benchmarks described in commit: 6075620b0590e ("sched/cputime: Mitigate performance regression in times()/clock_gettime()") The benchmark results are: SYS_clock_gettime(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%) 5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%) 8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%) 12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%) 21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%) 30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%) 48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%) 79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%) 110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%) 128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%) SYS_times(): threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch (pre-6e998916dfe3) 2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%) 5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%) 8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%) 12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%) 21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%) 30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%) 48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%) 79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%) 110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%) 128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%) Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/fair: Let asymmetric CPU configurations balance at wake-upMorten Rasmussen
Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if SD_BALANCE_WAKE is set on the sched_domains. For asymmetric configurations SD_WAKE_AFFINE is only desirable if the waking task's compute demand (utilization) is suitable for the waking CPU and the previous CPU, and all CPUs within their respective SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing take over (find_idlest_{group, cpu}()). This patch makes affine wake-ups conditional on whether both the waker CPU and the previous CPU has sufficient capacity for the waking task, or not, assuming that the CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1469453670-2660-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Store maximum per-CPU capacity in root domainDietmar Eggemann
To be able to compare the capacity of the target CPU with the highest available CPU capacity, store the maximum per-CPU capacity in the root domain. The max per-CPU capacity should be 1024 for all systems except SMT, where the capacity is currently based on smt_gain and the number of hardware threads and is <1024. If SMT can be brought to work with a per-thread capacity of 1024, this patch can be dropped and replaced by a hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/26c69258-9947-f830-a53e-0c54e7750646@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systemsMorten Rasmussen
A domain with the SD_ASYM_CPUCAPACITY flag set indicate that sched_groups at this level and below do not include CPUs of all capacities available (e.g. group containing little-only or big-only CPUs in big.LITTLE systems). It is therefore necessary to put in more effort in finding an appropriate CPU at task wake-up by enabling balancing at wake-up (SD_BALANCE_WAKE) on all lower (child) levels. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1469453670-2660-8-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Pass child domain into sd_init()Morten Rasmussen
If behavioural sched_domain flags depend on topology flags set at higher domain levels we need a way to update the child domain flags. Moving the child pointer assignment inside sd_init() should make that possible. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1469453670-2660-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flagMorten Rasmussen
Add a topology flag to the sched_domain hierarchy indicating the lowest domain level where the full range of CPU capacities is represented by the domain members for asymmetric capacity topologies (e.g. ARM big.LITTLE). The flag is intended to indicate that extra care should be taken when placing tasks on CPUs and this level spans all the different types of CPUs found in the system (no need to look further up the domain hierarchy). This information is currently only available through iterating through the capacities of all the CPUs at parent levels in the sched_domain hierarchy. SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY CPU: 0 1 2 3 capacity: 756 756 1024 1024 If the topology in the example above is duplicated to create an eight CPU example with third sched_domain level on top (SD 3), this level should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group would both have all CPU capacities represented within them. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Remove unnecessary NULL-pointer checkMorten Rasmussen
Checking if the sched_domain pointer returned by sd_init() is NULL seems pointless as sd_init() neither checks if it is valid to begin with nor set it to NULL. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1469453670-2660-5-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/core: Clarify SD_flags commentPeter Zijlstra
The SD_flags comment is very terse and doesn't explain why PACKING is odd. IIRC the distinction is that the 'normal' ones only describe topology, while the ASYM_PACKING one also prescribes behaviour. It is odd in the way that it doesn't only describe things. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: freedom.tan@mediatek.com Cc: keita.kobayashi.ym@renesas.com Cc: mgalbraith@suse.de Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/20160815105459.GS6879@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18Merge branch 'sched/urgent' into sched/core, to pick up dependenciesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/cputime: Resync steal time when guest & host lose syncWanpeng Li
Commit: 57430218317e ("sched/cputime: Count actually elapsed irq & softirq time") ... fixed a bug but also triggered a regression: On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs on host one by one until there is only one left, then observe CPU utilization via 'top' in the guest, it shows: 100% st for cpu0(housekeeping) 75% st for other CPUs (nohz full mode) However, w/o this commit it shows the correct 75% for all four CPUs. When a guest is interrupted for a longer amount of time, missed clock ticks are not redelivered later. Because of that, we should not limit the amount of steal time accounted to the amount of time that the calling functions think have passed. However, the interval returned by account_other_time() is NOT rounded down to the nearest jiffy, while the base interval in get_vtime_delta() it is subtracted from is, so the max cputime limit is required to avoid underflow. This patch fixes the regression by limiting the account_other_time() from get_vtime_delta() to avoid underflow, and lets the other three call sites (in account_other_time() and steal_account_process_time()) account however much steal time the host told us elapsed. Suggested-by: Rik van Riel <riel@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched: Remove struct rq::nohz_stampRik van Riel
The nohz_stamp member of struct rq has been unused since 2010, when this commit removed the code that referenced it: 396e894d289d ("sched: Revert nohz_ratelimit() for now") Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160815121410.5ea1c98f@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKGDavid Carrillo-Cisneros
Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events, that allows a PMU to signal the generic perf code that an event is readable in the current CPU if the event is active in a CPU in the same package as the current CPU. This is an optimization that avoids a unnecessary IPI for the common case where uncore events are run and read in the same package but in different CPUs. As an example, the IPI removal speeds up perf_read() in my Haswell system as follows: - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us. - For event RAPL's power/energy-cores/: From to 255 us to 27 us. For the optimization to work, all events in the group must have it (similarly to PERF_EV_CAP_SOFTWARE). Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regressionPeter Zijlstra
Mike reports: Roughly 10% of the time, ltp testcase getrusage04 fails: getrusage04 0 TINFO : Expected timers granularity is 4000 us getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)! getrusage04 0 TINFO : utime: 0us; stime: 179us getrusage04 0 TINFO : utime: 3751us; stime: 0us getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us: And tracked it down to the case where the task simply doesn't get _any_ [us]time ticks. Update the code to assume all rtime is utime when we lack information, thus ensuring a task that elides the tick gets time accounted. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: stable@vger.kernel.org # 4.3+ Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Generalize event->group_flagsDavid Carrillo-Cisneros
Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a group's leader to indicate that is_software_event(event) is true for all events in a group. This is the only usage of event->group_flags. This pattern of setting a group level flags when all events in the group share a property is useful for the flag introduced in the next patch and for future CQM/CMT flags. So this patches generalizes group_flags to work as an aggregate of event level flags. PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags that I intend to add are also determinable at event initialization. To better convey the above, this patch renames event's group_flags to group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE. Individual event flags are stored in the new event->event_caps. Since the cap flags do not change after event initialization, there is no need to serialize event_caps. This new field is used when events are added to a context, similarly to how PERF_GROUP_SOFTWARE and is_software_event() worked. Lastly, for consistency, updates is_software_event() to rely in event_cap instead of the context index. Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18bitmap.h, perf/core: Fix the mask in perf_output_sample_regs()Madhavan Srinivasan
When decoding the perf_regs mask in perf_output_sample_regs(), we loop through the mask using find_first_bit and find_next_bit functions. While the exisiting code works fine in most of the case, the logic is broken for big-endian 32-bit kernels. When reading a u64 mask using (u32 *)(&val)[0], find_*_bit() assumes that it gets the lower 32 bits of u64, but instead it gets the upper 32 bits - which is wrong. The fix is to swap the words of the u64 to handle this case. This is _not_ a regular endianness swap. Suggested-by: Yury Norov <ynorov@caviumnetworks.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yury Norov <ynorov@caviumnetworks.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1471426568-31051-2-git-send-email-maddy@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18Merge branch 'perf/urgent' into perf/core, to pick up dependenciesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Check return value of the perf_event_read() IPIDavid Carrillo-Cisneros
The call to smp_call_function_single in perf_event_read() may fail if an invalid or not online CPU index is passed. Warn user if such bug is present and return error. Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Enable mapping of the stop filtersMathieu Poirier
At this time the perf_addr_filter_needs_mmap() function will _not_ return true on a user space 'stop' filter. But stop filters need exactly the same kind of mapping that range and start filters get. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Update filters only on executable mmapMathieu Poirier
Function perf_event_mmap() is called by the MM subsystem each time part of a binary is loaded in memory. There can be several mapping for a binary, many times unrelated to the code section. Each time a section of a binary is mapped address filters are updated, event when the map doesn't pertain to the code section. The end result is that filters are configured based on the last map event that was received rather than the last mapping of the code segment. For example if we have an executable 'main' that calls library 'libcstest.so.1.0', and that we want to collect traces on code that is in that library. The perf cmd line for this scenario would be: perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main Resulting in binaries being mapped this way: root@linaro-nano:~# cat /proc/1950/maps 00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main 00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main 00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main 7fa2464000-7fa2474000 rw-p 00000000 00:00 0 7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0 7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0 7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0 7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar] 7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso] 7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack] root@linaro-nano:~# Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in memory. Once that has been done perf_event_mmap() has been called 4 times, with the last map starting at address 0x7fa25ce000 and the address filter configured to start filtering when the IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c). But that is wrong since the code segment for library 'libcstest.so.1.0' as been mapped at 0x7fa25bd000, resulting in traces not being collected. This patch corrects the situation by requesting that address filters be updated only if the mapped event is for a code segment. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Fix file name handling for start/stop filtersMathieu Poirier
Binary file names have to be supplied for both range and start/stop filters but the current code only processes the filename if an address range filter is specified. This code adds processing of the filename for start/stop filters. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18perf/core: Fix event_function_local()Peter Zijlstra
Vincent reported triggering the WARN_ON_ONCE() in event_function_local(). While thinking through cases I noticed that by using event_function() directly, we miss the inactive case usually handled by event_function_call(). Therefore construct a blend of event_function_call() and event_function() that handles the cases relevant to event_function_local(). Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 4.5+ Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18uprobes: Rename the "struct page *" args of __replace_page()Oleg Nesterov
Purely cosmetic, no changes in the compiled code. Perhaps it is just me but I can hardly read __replace_page() because I can't distinguish "page" from "kpage" and because I need to look at the caller to to ensure that, say, kpage is really the new page and the code is correct. Rename them to old_page and new_page, this matches the caller. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brenden Blanco <bblanco@plumgrid.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Link: http://lkml.kernel.org/r/20160817153704.GC29724@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18Merge branch 'perf/urgent' into perf/core, to pick up dependencyIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18uprobes: Fix the memcg accountingOleg Nesterov
__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path, it should only do this if page_check_address() fails. This means that every enable/disable leads to unbalanced mem_cgroup_uncharge() from put_page(old_page), it is trivial to underflow the page_counter->count and trigger OOM. Reported-and-tested-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: stable@vger.kernel.org # 3.17+ Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API") Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor overlapping changes for both merge conflicts. Resolution work done by Stephen Rothwell was used as a reference. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / hibernate: Fix rtree_next_node() to avoid walking off list ends x86/power/64: Use __pa() for physical address computation PM / sleep: Update some system sleep documentation
2016-08-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Buffers powersave frame test is reversed in cfg80211, fix from Felix Fietkau. 2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme. 3) Fix some tg3 ethtool logic bugs, and one that would cause no interrupts to be generated when rx-coalescing is set to 0. From Satish Baddipadige and Siva Reddy Kallam. 4) QLCNIC mailbox corruption and napi budget handling fix from Manish Chopra. 5) Fix fib_trie logic when walking the trie during /proc/net/route output than can access a stale node pointer. From David Forster. 6) Several sctp_diag fixes from Phil Sutter. 7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel. 8) Checksum fixup fixes in bpf from Daniel Borkmann. 9) Memork leaks in nfnetlink, from Liping Zhang. 10) Use after free in rxrpc, from David Howells. 11) Use after free in new skb_array code of macvtap driver, from Jason Wang. 12) Calipso resource leak, from Colin Ian King. 13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang. 14) Fix bpf non-linear packet write helpers, from Daniel Borkmann. 15) Fix lockdep splats in macsec, from Sabrina Dubroca. 16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF handling. 17) Various tc-action bug fixes, from CONG Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits) net_sched: allow flushing tc police actions net_sched: unify the init logic for act_police net_sched: convert tcf_exts from list to pointer array net_sched: move tc offload macros to pkt_cls.h net_sched: fix a typo in tc_for_each_action() net_sched: remove an unnecessary list_del() net_sched: remove the leftover cleanup_a() mlxsw: spectrum: Allow packets to be trapped from any PG mlxsw: spectrum: Unmap 802.1Q FID before destroying it mlxsw: spectrum: Add missing rollbacks in error path mlxsw: reg: Fix missing op field fill-up mlxsw: spectrum: Trap loop-backed packets mlxsw: spectrum: Add missing packet traps mlxsw: spectrum: Mark port as active before registering it mlxsw: spectrum: Create PVID vPort before registering netdevice mlxsw: spectrum: Remove redundant errors from the code mlxsw: spectrum: Don't return upon error in removal path i40e: check for and deal with non-contiguous TCs ixgbe: Re-enable ability to toggle VLAN filtering ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths ...
2016-08-17cgroup: reduce read locked section of cgroup_threadgroup_rwsem during forkBalbir Singh
cgroup_threadgroup_rwsem is acquired in read mode during process exit and fork. It is also grabbed in write mode during __cgroups_proc_write(). I've recently run into a scenario with lots of memory pressure and OOM and I am beginning to see systemd __switch_to+0x1f8/0x350 __schedule+0x30c/0x990 schedule+0x48/0xc0 percpu_down_write+0x114/0x170 __cgroup_procs_write.isra.12+0xb8/0x3c0 cgroup_file_write+0x74/0x1a0 kernfs_fop_write+0x188/0x200 __vfs_write+0x6c/0xe0 vfs_write+0xc0/0x230 SyS_write+0x6c/0x110 system_call+0x38/0xb4 This thread is waiting on the reader of cgroup_threadgroup_rwsem to exit. The reader itself is under memory pressure and has gone into reclaim after fork. There are times the reader also ends up waiting on oom_lock as well. __switch_to+0x1f8/0x350 __schedule+0x30c/0x990 schedule+0x48/0xc0 jbd2_log_wait_commit+0xd4/0x180 ext4_evict_inode+0x88/0x5c0 evict+0xf8/0x2a0 dispose_list+0x50/0x80 prune_icache_sb+0x6c/0x90 super_cache_scan+0x190/0x210 shrink_slab.part.15+0x22c/0x4c0 shrink_zone+0x288/0x3c0 do_try_to_free_pages+0x1dc/0x590 try_to_free_pages+0xdc/0x260 __alloc_pages_nodemask+0x72c/0xc90 alloc_pages_current+0xb4/0x1a0 page_table_alloc+0xc0/0x170 __pte_alloc+0x58/0x1f0 copy_page_range+0x4ec/0x950 copy_process.isra.5+0x15a0/0x1870 _do_fork+0xa8/0x4b0 ppc_clone+0x8/0xc In the meanwhile, all processes exiting/forking are blocked almost stalling the system. This patch moves the threadgroup_change_begin from before cgroup_fork() to just before cgroup_canfork(). There is no nee to worry about threadgroup changes till the task is actually added to the threadgroup. This avoids having to call reclaim with cgroup_threadgroup_rwsem held. tj: Subject and description edits. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
2016-08-17genirq: Correctly configure the trigger on chained interruptsMarc Zyngier
Commit 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") moved the trigger configuration call from the irqdomain mapping to the interrupt being actually requested. This patch failed to handle the case where we configure a chained interrupt, which doesn't get requested through the usual path. In order to solve this, let's call __irq_set_trigger just before starting the cascade interrupt. Special care must be taken to make the flow handler stick, as the .irq_set_type method could have reset it (it doesn't know we're dealing with a chained interrupt). Based on an initial patch by Jon Hunter. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Linus Walleij <linus.walleij@linaro.org> Tested-by: John Stultz <john.stultz@linaro.org> Acked-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2016-08-16cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()Rafael J. Wysocki
All of the callers of cpufreq_update_util() pass rq_clock(rq) to it as the time argument and some of them check whether or not cpu_of(rq) is equal to smp_processor_id() before calling it, so rework it to take a runqueue pointer as the argument and move the rq_clock(rq) evaluation into it. Additionally, provide a wrapper checking cpu_of(rq) against smp_processor_id() for the cpufreq_update_util() callers that need it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16cpufreq / sched: Pass flags to cpufreq_update_util()Rafael J. Wysocki
It is useful to know the reason why cpufreq_update_util() has just been called and that can be passed as flags to cpufreq_update_util() and to the ->func() callback in struct update_util_data. However, doing that in addition to passing the util and max arguments they already take would be clumsy, so avoid it. Instead, use the observation that the schedutil governor is part of the scheduler proper, so it can access scheduler data directly. This allows the util and max arguments of cpufreq_update_util() and the ->func() callback in struct update_util_data to be replaced with a flags one, but schedutil has to be modified to follow. Thus make the schedutil governor obtain the CFS utilization information from the scheduler and use the "RT" and "DL" flags instead of the special utilization value of ULONG_MAX to track updates from the RT and DL sched classes. Make it non-modular too to avoid having to export scheduler variables to modules at large. Next, update all of the other users of cpufreq_update_util() and the ->func() callback in struct update_util_data accordingly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16block: Fix secure eraseAdrian Hunter
Commit 288dab8a35a0 ("block: add a separate operation type for secure erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering all the places REQ_OP_DISCARD was being used to mean either. Fix those. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase") Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-16PM / hibernate: Fix rtree_next_node() to avoid walking off list endsJames Morse
rtree_next_node() walks the linked list of leaf nodes to find the next block of pages in the struct memory_bitmap. If it walks off the end of the list of nodes, it walks the list of memory zones to find the next region of memory. If it walks off the end of the list of zones, it returns false. This leaves the struct bm_position's node and zone pointers pointing at their respective struct list_heads in struct mem_zone_bm_rtree. memory_bm_find_bit() uses struct bm_position's node and zone pointers to avoid walking lists and trees if the next bit appears in the same node/zone. It handles these values being stale. Swap rtree_next_node()s 'step then test' to 'test-next then step', this means if we reach the end of memory we return false and leave the node and zone pointers as they were. This fixes a panic on resume using AMD Seattle with 64K pages: [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done. [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds) [ 6.896453] PM: Using 3 thread(s) for decompression. [ 6.896453] PM: Loading and decompressing image data (5339 pages)... [ 7.318890] PM: Image loading progress: 0% [ 7.323395] Unable to handle kernel paging request at virtual address 00800040 [ 7.330611] pgd = ffff000008df0000 [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000 [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 7.349825] Modules linked in: [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737 [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016 [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000 [ 7.375020] PC is at set_bit+0x18/0x30 [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30 [ 7.383362] pc : [<ffff00000835bbc8>] lr : [<ffff0000080faf18>] pstate: 60000045 [ 7.390743] sp : ffff8003c0283b00 [ 7.473551] [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020) [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000) [ 7.800075] Call trace: [ 7.887097] [<ffff00000835bbc8>] set_bit+0x18/0x30 [ 7.891876] [<ffff0000080fb038>] duplicate_memory_bitmap.constprop.38+0x54/0x70 [ 7.899172] [<ffff0000080fcc40>] snapshot_write_next+0x22c/0x47c [ 7.905166] [<ffff0000080fe1b4>] load_image_lzo+0x754/0xa88 [ 7.910725] [<ffff0000080ff0a8>] swsusp_read+0x144/0x230 [ 7.916025] [<ffff0000080fa338>] load_image_and_restore+0x58/0x90 [ 7.922105] [<ffff0000080fa660>] software_resume+0x2f0/0x338 [ 7.927752] [<ffff000008083350>] do_one_initcall+0x38/0x11c [ 7.933314] [<ffff000008b40cc0>] kernel_init_freeable+0x14c/0x1ec [ 7.939395] [<ffff0000087ce564>] kernel_init+0x10/0xfc [ 7.944520] [<ffff000008082e90>] ret_from_fork+0x10/0x40 [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22) [ 7.955909] ---[ end trace 0024a5986e6ff323 ]--- [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Here struct mem_zone_bm_rtree's start_pfn has been returned instead of struct rtree_node's addr as the node/zone pointers are corrupt after we walked off the end of the lists during mark_unsafe_pages(). This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate: Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking off the end of the memory bitmap. Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree) Signed-off-by: James Morse <james.morse@arm.com> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-08-12bpf: allow bpf_get_prandom_u32() to be used in tracingAlexei Starovoitov
bpf_get_prandom_u32() was initially introduced for socket filters and later requested numberous times to be added to tracing bpf programs for the same reason as in socket filters: to be able to randomly select incoming events. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12bpf: allow helpers access the packet directlyAlexei Starovoitov
The helper functions like bpf_map_lookup_elem(map, key) were only allowing 'key' to point to the initialized stack area. That is causing performance degradation when programs need to process millions of packets per second and need to copy contents of the packet into the stack just to pass the stack pointer into the lookup() function. Allow such helpers read from the packet directly. All helpers that expect ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ARG_PTR_TO_STACK assume byte aligned pointer, so no alignment concerns, only need to check that helper will not be accessing beyond the packet range verified by the prior 'if (ptr < data_end)' condition. For now allow this feature for XDP programs only. Later it can be relaxed for the clsact programs as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12bpf: fix bpf_skb_in_cgroup helper namingDaniel Borkmann
While hashing out BPF's current_task_under_cgroup helper bits, it came to discussion that the skb_in_cgroup helper name was suboptimally chosen. Tejun says: So, I think in_cgroup should mean that the object is in that particular cgroup while under_cgroup in the subhierarchy of that cgroup. Let's rename the other subhierarchy test to under too. I think that'd be a lot less confusing going forward. [...] It's more intuitive and gives us the room to implement the real "in" test if ever necessary in the future. Since this touches uapi bits, we need to change this as long as v4.8 is not yet officially released. Thus, change the helper enum and rename related bits. Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto") Reference: http://patchwork.ozlabs.org/patch/658500/ Suggested-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2016-08-12bpf: Add bpf_current_task_under_cgroup helperSargun Dhillon
This adds a bpf helper that's similar to the skb_in_cgroup helper to check whether the probe is currently executing in the context of a specific subset of the cgroupsv2 hierarchy. It does this based on membership test for a cgroup arraymap. It is invalid to call this in an interrupt, and it'll return an error. The helper is primarily to be used in debugging activities for containers, where you may have multiple programs running in a given top-level "container". Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12Merge tag 'pm-4.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Two hibernation fixes allowing it to work with the recently added randomization of the kernel identity mapping base on x86-64 and one cpufreq driver regression fix. Specifics: - Fix the x86 identity mapping creation helpers to avoid the assumption that the base address of the mapping will always be aligned at the PGD level, as it may be aligned at the PUD level if address space randomization is enabled (Rafael Wysocki). - Fix the hibernation core to avoid executing tracing functions before restoring the processor state completely during resume (Thomas Garnier). - Fix a recently introduced regression in the powernv cpufreq driver that causes it to crash due to an out-of-bounds array access (Akshay Adiga)" * tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / hibernate: Restore processor state before using per-CPU variables x86/power/64: Always create temporary identity mapping correctly cpufreq: powernv: Fix crash in gpstate_timer_handler()
2016-08-12Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Misc fixes: a /dev/rtc regression fix, two APIC timer period calibration fixes, an ARM clocksource driver fix and a NOHZ power use regression fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup x86/timers/apic: Inform TSC deadline clockevent device about recalibration x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error timers: Fix get_next_timer_interrupt() computation clocksource/arm_arch_timer: Force per-CPU interrupt to be level-triggered
2016-08-12Merge branches 'pm-sleep' and 'pm-cpufreq'Rafael J. Wysocki
* pm-sleep: PM / hibernate: Restore processor state before using per-CPU variables x86/power/64: Always create temporary identity mapping correctly * pm-cpufreq: cpufreq: powernv: Fix crash in gpstate_timer_handler()
2016-08-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: cputime fixes, two deadline scheduler fixes and a cgroups scheduling fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Fix omitted ticks passed in parameter sched/cputime: Fix steal time accounting sched/deadline: Fix lock pinning warning during CPU hotplug sched/cputime: Mitigate performance regression in times()/clock_gettime() sched/fair: Fix typo in sync_throttle() sched/deadline: Fix wrap-around in DL heap
2016-08-12PM / hibernate: Restore processor state before using per-CPU variablesThomas Garnier
Restore the processor state before calling any other functions to ensure per-CPU variables can be used with KASLR memory randomization. Tracing functions use per-CPU variables (GS based on x86) and one was called just before restoring the processor state fully. It resulted in a double fault when both the tracing & the exception handler functions tried to use a per-CPU variable. Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume) Reported-and-tested-by: Borislav Petkov <bp@suse.de> Reported-by: Jiri Kosina <jikos@kernel.org> Tested-by: Rafael J. Wysocki <rafael@kernel.org> Tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Garnier <thgarnie@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-08-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Mostly tooling fixes, plus two uncore-PMU fixes, an uprobes fix, a perf-cgroups fix and an AUX events fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Add enable_box for client MSR uncore perf/x86/intel/uncore: Fix uncore num_counters uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions perf/core: Set cgroup in CPU contexts for new cgroup events perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash perf probe ppc64le: Fix probe location when using DWARF perf probe: Add function to post process kernel trace events tools: Sync cpufeatures headers with the kernel toops: Sync tools/include/uapi/linux/bpf.h with the kernel tools: Sync cpufeatures.h and vmx.h with the kernel perf probe: Support signedness casting perf stat: Avoid skew when reading events perf probe: Fix module name matching perf probe: Adjust map->reloc offset when finding kernel symbol from map perf hists: Trim libtraceevent trace_seq buffers perf script: Add 'bpf-output' field to usage message
2016-08-12Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Misc fixes: lockstat fix, futex fix on !MMU systems, big endian fix for qrwlocks and a race fix for pvqspinlocks" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/pvqspinlock: Fix a bug in qstat_read() locking/pvqspinlock: Fix double hash race locking/qrwlock: Fix write unlock bug on big endian systems futex: Assume all mappings are private on !MMU systems
2016-08-12Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "A fix for an MSI regression" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi: Make sure PCI MSIs are activated early