summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-09-08nohz: Fix update_ts_time_stat idle accountingMichal Hocko
update_ts_time_stat currently updates idle time even if we are in iowait loop at the moment. The only real users of the idle counter (via get_cpu_idle_time_us) are CPU governors and they expect to get cumulative time for both idle and iowait times. The value (idle_sleeptime) is also printed to userspace by print_cpu but it prints both idle and iowait times so the idle part is misleading. Let's clean this up and fix update_ts_time_stat to account both counters properly and update consumers of idle to consider iowait time as well. If we do this we might use get_cpu_{idle,iowait}_time_us from other contexts as well and we will get expected values. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-07Merge branch 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tipLinus Torvalds
* 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: rtc: twl: Fix registration vs. init order rtc: Initialized rtc_time->tm_isdst rtc: Fix RTC PIE frequency limit rtc: rtc-twl: Remove lockdep related local_irq_enable() rtc: rtc-twl: Switch to using threaded irq rtc: ep93xx: Fix 'rtc' may be used uninitialized warning alarmtimers: Avoid possible denial of service with high freq periodic timers alarmtimers: Memset itimerspec passed into alarm_timer_get alarmtimers: Avoid possible null pointer traversal
2011-09-07Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tipLinus Torvalds
* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: sched: Fix a memory leak in __sdt_free() sched: Move blk_schedule_flush_plug() out of __schedule() sched: Separate the scheduler entry for preemption
2011-08-31perf_event: Fix broken calc_timer_values()Eric B Munson
We detected a serious issue with PERF_SAMPLE_READ and timing information when events were being multiplexing. Samples would have time_running > time_enabled. That was easy to reproduce with a libpfm4 example (ran 3 times to cause multiplexing on Core 2): $ syst_smpl -e uops_retired:freq=1 & $ syst_smpl -e uops_retired:freq=1 & $ syst_smpl -e uops_retired:freq=1 & IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184 syst_smpl: WARNING: time_running > time_enabled 63277537998 uops_retired:freq=1 , scaled The bug was not present in kernel up to (and including) 3.0. It turns out the bug was introduced by the following commit: commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa events: Move lockless timer calculation into helper function The parameters of the function got reversed yet the call sites were not updated to reflect the change. That lead to time_running and time_enabled being swapped. That had no effect when there was no multiplexing because in that case time_running = time_enabled but it would show up in any other scenario. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-31perf: provide PMU when initing eventsMark Rutland
Currently, an event's 'pmu' field is set after pmu::event_init() is called. This means that pmu::event_init() must figure out which struct pmu the event was initialised from. This makes it difficult to consolidate common event initialisation code for similar PMUs, and very difficult to implement drivers for PMUs which can have multiple instances (e.g. a USB controller PMU, a GPU PMU, etc). This patch sets the 'pmu' field before initialising the event, allowing event init code to identify the struct pmu instance easily. In the event of failure to initialise an event, the event is destroyed via kfree() without calling perf_event::destroy(), so this shouldn't result in bad behaviour even if the destroy field was set before failure to initialise was noted. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1313062280-19123-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-30trace: Add ring buffer stats to measure rate of eventsVaibhav Nagarnaik
The stats file under per_cpu folder provides the number of entries, overruns and other statistics about the CPU ring buffer. However, the numbers do not provide any indication of how full the ring buffer is in bytes compared to the overall size in bytes. Also, it is helpful to know the rate at which the cpu buffer is filling up. This patch adds an entry "bytes: " in printed stats for per_cpu ring buffer which provides the actual bytes consumed in the ring buffer. This field includes the number of bytes used by recorded events and the padding bytes added when moving the tail pointer to next page. It also adds the following time stamps: "oldest event ts:" - the oldest timestamp in the ring buffer "now ts:" - the timestamp at the time of reading The field "now ts" provides a consistent time snapshot to the userspace when being read. This is read from the same trace clock used by tracing event timestamps. Together, these values provide the rate at which the buffer is filling up, from the formula: bytes / (now_ts - oldest_event_ts) Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-30trace: Add a new readonly entry to report total buffer sizeVaibhav Nagarnaik
The current file "buffer_size_kb" reports the size of per-cpu buffer and not the overall memory allocated which could be misleading. A new file "buffer_total_size_kb" adds up all the enabled CPU buffer sizes and reports it. This is only a readonly entry. Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Link: http://lkml.kernel.org/r/1313531179-9323-2-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-30tracing: Add preempt disable for filter self testSteven Rostedt
The self testing for event filters does not really need preemption disabled as there are no races at the time of testing, but the functions it calls uses rcu_dereference_sched() which will complain if preemption is not disabled. Cc: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-29perf events: Fix slow and broken cgroup context switch codeStephane Eranian
The current cgroup context switch code was incorrect leading to bogus counts. Furthermore, as soon as there was an active cgroup event on a CPU, the context switch cost on that CPU would increase by a significant amount as demonstrated by a simple ping/pong example: $ ./pong Both processes pinned to CPU1, running for 10s 10684.51 ctxsw/s Now start a cgroup perf stat: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 $ ./pong Both processes pinned to CPU1, running for 10s 6674.61 ctxsw/s That's a 37% penalty. Note that pong is not even in the monitored cgroup. The results shown by perf stat are bogus: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 Performance counter stats for 'sleep 100': CPU1 <not counted> cycles test CPU1 16,984,189,138 cycles # 0.000 GHz The second 'cycles' event should report a count @ CPU clock (here 2.4GHz) as it is counting across all cgroups. The patch below fixes the bogus accounting and bypasses any cgroup switches in case the outgoing and incoming tasks are in the same cgroup. With this patch the same test now yields: $ ./pong Both processes pinned to CPU1, running for 10s 10775.30 ctxsw/s Start perf stat with cgroup: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Run pong outside the cgroup: $ /pong Both processes pinned to CPU1, running for 10s 10687.80 ctxsw/s The penalty is now less than 2%. And the results for perf stat are correct: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 <not counted> cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz Now perf stat reports the correct counts for for the non cgroup event. If we run pong inside the cgroup, then we also get the correct counts: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 22,297,726,205 cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz 10.001457237 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Fix a memory leak in __sdt_free()WANG Cong
This patch fixes the following memory leak: unreferenced object 0xffff880107266800 (size 512): comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81133940>] create_object+0x187/0x28b [<ffffffff814ac103>] kmemleak_alloc+0x73/0x98 [<ffffffff811232ba>] __kmalloc_node+0x104/0x159 [<ffffffff81044b98>] kzalloc_node.clone.97+0x15/0x17 [<ffffffff8104cb90>] build_sched_domains+0xb7/0x7f3 [<ffffffff8104d4df>] partition_sched_domains+0x1db/0x24a [<ffffffff8109ee4a>] do_rebuild_sched_domains+0x3b/0x47 [<ffffffff810a00c7>] rebuild_sched_domains+0x10/0x12 [<ffffffff8104d5ba>] sched_power_savings_store+0x6c/0x7b [<ffffffff8104d5df>] sched_mc_power_savings_store+0x16/0x18 [<ffffffff8131322c>] sysdev_class_store+0x20/0x22 [<ffffffff81193876>] sysfs_write_file+0x108/0x144 [<ffffffff81135b10>] vfs_write+0xaf/0x102 [<ffffffff81135d23>] sys_write+0x4d/0x74 [<ffffffff814c8a42>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org # 3.0 Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Move blk_schedule_flush_plug() out of __schedule()Thomas Gleixner
There is no real reason to run blk_schedule_flush_plug() with interrupts and preemption disabled. Move it into schedule() and call it when the task is going voluntarily to sleep. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from being woken immediately after switching away. This fixes a deadlock in the scheduler where the blk_schedule_flush_plug() callchain enables interrupts and thereby allows a wakeup to happen of the task that's going to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@kernel.org # 2.6.39+ Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Separate the scheduler entry for preemptionThomas Gleixner
Block-IO and workqueues call into notifier functions from the scheduler core code with interrupts and preemption disabled. These calls should be made before entering the scheduler core. To simplify this, separate the scheduler core code into __schedule(). __schedule() is directly called from the places which set PREEMPT_ACTIVE and from schedule(). This allows us to add the work checks into schedule(), so they are only called when a task voluntary goes to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@kernel.org # 2.6.39+ Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-26All Arch: remove linkage for sys_nfsservctl system callNeilBrown
The nfsservctl system call is now gone, so we should remove all linkage for it. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25kernel/printk: do not turn off bootconsole in printk_late_init() if keep_bootconNishanth Aravamudan
It seems that 7bf693951a8e ("console: allow to retain boot console via boot option keep_bootcon") doesn't always achieve what it aims, as when printk_late_init() runs it unconditionally turns off all boot consoles. With this patch, I am able to see more messages on the boot console in KVM guests than I can without, when keep_bootcon is specified. I think it is appropriate for the relevant -stable trees. However, it's more of an annoyance than a serious bug (ideally you don't need to keep the boot console around as console handover should be working -- I was encountering a situation where the console handover wasn't working and not having the boot console available meant I couldn't see why). Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Acked-by: Fabio M. Di Nitto <fdinitto@redhat.com> Cc: <stable@kernel.org> [2.6.39.x, 3.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25Add a personality to report 2.6.x version numbersAndi Kleen
I ran into a couple of programs which broke with the new Linux 3.0 version. Some of those were binary only. I tried to use LD_PRELOAD to work around it, but it was quite difficult and in one case impossible because of a mix of 32bit and 64bit executables. For example, all kind of management software from HP doesnt work, unless we pretend to run a 2.6 kernel. $ uname -a Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux $ hpacucli ctrl all show Error: No controllers detected. $ rpm -qf /usr/sbin/hpacucli hpacucli-8.75-12.0 Another notable case is that Python now reports "linux3" from sys.platform(); which in turn can break things that were checking sys.platform() == "linux2": https://bugzilla.mozilla.org/show_bug.cgi?id=664564 It seems pretty clear to me though it's a bug in the apps that are using '==' instead of .startswith(), but this allows us to unbreak broken programs. This patch adds a UNAME26 personality that makes the kernel report a 2.6.40+x version number instead. The x is the x in 3.x. I know this is somewhat ugly, but I didn't find a better workaround, and compatibility to existing programs is important. Some programs also read /proc/sys/kernel/osrelease. This can be worked around in user space with mount --bind (and a mount namespace) To use: wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c gcc -o uname26 uname26.c ./uname26 program Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25PM QoS: Add global notification mechanism for device constraintsJean Pihet
Add a global notification chain that gets called upon changes to the aggregated constraint value for any device. The notification callbacks are passing the full constraint request data in order for the callees to have access to it. The current use is for the platform low-level code to access the target device of the constraint. Signed-off-by: Jean Pihet <j-pihet@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-25PM QoS: Generalize and export constraints management codeJean Pihet
In preparation for the per-device constratins support: - rename update_target to pm_qos_update_target - generalize and export pm_qos_update_target for usage by the upcoming per-device latency constraints framework: * operate on struct pm_qos_constraints for constraints management, * introduce an 'action' parameter for constraints add/update/remove, * the return value indicates if the aggregated constraint value has changed, - update the internal code to operate on struct pm_qos_constraints - add a NULL pointer check in the API functions Signed-off-by: Jean Pihet <j-pihet@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-25PM QoS: Reorganize data structsJean Pihet
In preparation for the per-device constratins support, re-organize the data strctures: - add a struct pm_qos_constraints which contains the constraints related data - update struct pm_qos_object contents to the PM QoS internal object data. Add a pointer to struct pm_qos_constraints - update the internal code to use the new data structs. Signed-off-by: Jean Pihet <j-pihet@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-25PM QoS: Code reorganizationJean Pihet
Move around the PM QoS misc devices management code for better readability. Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-25PM QoS: Minor clean-upsJean Pihet
- Misc fixes to improve code readability: * rename struct pm_qos_request_list to struct pm_qos_request, * rename pm_qos_req parameter to req in internal code, consistenly use req in the API parameters, * update the in-kernel API callers to the new parameters names, * rename of fields names (requests, list, node, constraints) Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-25PM QoS: Move and rename the implementation filesJean Pihet
The PM QoS implementation files are better named kernel/power/qos.c and include/linux/pm_qos.h. The PM QoS support is compiled under the CONFIG_PM option. Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-23Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix tracing builds inside the source tree xfs: remove subdirectories xfs: don't expect xfs headers to be in subdirectories
2011-08-23Revert "irq: Always set IRQF_ONESHOT if no primary handler is specified"Linus Torvalds
This reverts commit f3637a5f2e2eb391ff5757bc83fb5de8f9726464. It turns out that this breaks several drivers, one example being OMAP boards which use the on-board OMAP UARTs and the omap-serial driver that will not boot to userspace after the commit. Paul Walmsley reports that enabling CONFIG_DEBUG_SHIRQ reveals 'IRQ handler type mismatch' errors: IRQ handler type mismatch for IRQ 74 current handler: serial idle ... and the reason is that setting IRQF_ONESHOT will now result in those interrupt handlers having different IRQF flags, and thus being unsharable. So the commit log in the reverted commit: "Since it is required for those users and there is no difference for others it makes sense to add this flag unconditionally." is simply not true: there may not be any difference from a "actions at irq time", but there is a *big* difference wrt this flag testing irq management (see __setup_irq() in kernel/irq/manage.c). One solution may be to stop verifying IRQF_ONESHOT in __setup_irq(), but right now the safe course of action is to revert the change. Let's revisit this in a later merge window. Reported-by: Paul Walmsley <paul@pwsan.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Requested-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-23CRED: fix build error due to 'tgcred' undeclaredAxel Lin
This patch adds CONFIG_KEYS guard for tgcred to fix below build error if CONFIG_KEYS is not configured. CC kernel/cred.o kernel/cred.c: In function 'prepare_kernel_cred': kernel/cred.c:657: error: 'tgcred' undeclared (first use in this function) kernel/cred.c:657: error: (Each undeclared identifier is reported only once kernel/cred.c:657: error: for each function it appears in.) make[1]: *** [kernel/cred.o] Error 1 make: *** [kernel] Error 2 Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-08-23CRED: Fix prepare_kernel_cred() to provide a new thread_group_cred structDavid Howells
Fix prepare_kernel_cred() to provide a new, separate thread_group_cred struct otherwise when using request_key() ____call_usermodehelper() calls umh_keys_init() with the new creds pointing to init_tgcred, which umh_keys_init() then blithely alters. The problem can be demonstrated by: # keyctl request2 user a debug:a @s 249681132 # grep req /proc/keys 079906a5 I--Q-- 1 perm 1f3f0000 0 0 keyring _req.249681132: 1/4 38ef1626 IR---- 1 expd 0b010000 0 0 .request_ key:ee1d4ec pid:4371 ci:1 The keyring _req.XXXX should have gone away, but something (init_tgcred) is pinning it. That key actually requested can then be removed and a new one created: # keyctl unlink 249681132 1 links removed [root@andromeda ~]# grep req /proc/keys 116cecac IR---- 1 expd 0b010000 0 0 .request_ key:eeb4911 pid:4379 ci:1 36d1cbf8 I--Q-- 1 perm 1f3f0000 0 0 keyring _req.250300689: 1/4 which causes the old _req keyring to go away and a new one to take its place. This is a consequence of the changes in: commit 879669961b11e7f40b518784863a259f735a72bf Author: David Howells <dhowells@redhat.com> Date: Fri Jun 17 11:25:59 2011 +0100 KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyring and: commit 17f60a7da150fdd0cfb9756f86a262daa72c835f Author: Eric Paris <eparis@redhat.com> Date: Fri Apr 1 17:07:50 2011 -0400 capabilites: allow the application of capability limits to usermode helpers After this patch is applied, the _req keyring and the .request_key key are cleaned up. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-08-19tracing/filter: Add startup tests for events filterJiri Olsa
Adding automated tests running as late_initcall. Tests are compiled in with CONFIG_FTRACE_STARTUP_TEST option. Adding test event "ftrace_test_filter" used to simulate filter processing during event occurance. String filters are compiled and tested against several test events with different values. Also testing that evaluation of explicit predicates is ommited due to the lazy filter evaluation. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Change filter_match_preds function to use walk_pred_treeJiri Olsa
Changing filter_match_preds function to use unified predicates tree processing. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-10-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Change fold_pred function to use walk_pred_treeJiri Olsa
Changing fold_pred_tree function to use unified predicates tree processing. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-9-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Change fold_pred_tree function to use walk_pred_treeJiri Olsa
Changing fold_pred_tree function to use unified predicates tree processing. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-8-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Change count_leafs function to use walk_pred_treeJiri Olsa
Changing count_leafs function to use unified predicates tree processing. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-7-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Unify predicate tree walking, change check_pred_tree ↵Jiri Olsa
function to use it Adding walk_pred_tree function to be used for walking throught the filter predicates. For each predicate the callback function is called, allowing users to add their own functionality or customize their way through the filter predicates. Changing check_pred_tree function to use walk_pred_tree. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-6-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Simplify tracepoint event lookupJiri Olsa
We dont need to perform lookup through the ftrace_events list, instead we can use the 'tp_event' field. Each perf_event contains tracepoint event field 'tp_event', which got initialized during the tracepoint event initialization. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-5-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Remove field_name from filter_pred structJiri Olsa
The field_name was used just for finding event's fields. This way we don't need to care about field_name allocation/free. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-4-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Separate predicate init and filter additionJiri Olsa
Making the code cleaner by having one function to fully prepare the predicate (create_pred), and another to add the predicate to the filter (filter_add_pred). As a benefit, this way the dry_run flag stays only inside the replace_preds function and is not passed deeper. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-3-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19tracing/filter: Use static allocation for filter predicatesJiri Olsa
Don't dynamically allocate filter_pred struct, use static memory. This way we can get rid of the code managing the dynamic filter_pred struct object. The create_pred function integrates create_logical_pred function. This way the static predicate memory is returned only from one place. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1313072754-4620-2-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-08-19Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...
2011-08-18irqdesc: fix new kernel-doc warningRandy Dunlap
Fix kernel-doc warning in irqdesc.c: Warning(kernel/irq/irqdesc.c:353): No description found for parameter 'owner' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-18Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge reason: add the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-17Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / Domains: Fix build for CONFIG_PM_RUNTIME unset
2011-08-17Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix wrong assumption in match_held_lock
2011-08-17Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: Track the owner of irq descriptor irq: Always set IRQF_ONESHOT if no primary handler is specified genirq: Fix wrong bit operation
2011-08-14PM / Domains: Fix build for CONFIG_PM_RUNTIME unsetRafael J. Wysocki
Function genpd_queue_power_off_work() is not defined for CONFIG_PM_RUNTIME, so pm_genpd_poweroff_unused() causes a build error to happen in that case. Fix the problem by making pm_genpd_poweroff_unused() depend on CONFIG_PM_RUNTIME too. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-14sched: Return unused runtime on group dequeuePaul Turner
When a local cfs_rq blocks we return the majority of its remaining quota to the global bandwidth pool for use by other runqueues. We do this only when the quota is current and there is more than min_cfs_rq_quota [1ms by default] of runtime remaining on the rq. In the case where there are throttled runqueues and we have sufficient bandwidth to meter out a slice, a second timer is kicked off to handle this delivery, unthrottling where appropriate. Using a 'worst case' antagonist which executes on each cpu for 1ms before moving onto the next on a fairly large machine: no quota generations: 197.47 ms /cgroup/a/cpuacct.usage 199.46 ms /cgroup/a/cpuacct.usage 205.46 ms /cgroup/a/cpuacct.usage 198.46 ms /cgroup/a/cpuacct.usage 208.39 ms /cgroup/a/cpuacct.usage Since we are allowed to use "stale" quota our usage is effectively bounded by the rate of input into the global pool and performance is relatively stable. with quota generations [1s increments]: 119.58 ms /cgroup/a/cpuacct.usage 119.65 ms /cgroup/a/cpuacct.usage 119.64 ms /cgroup/a/cpuacct.usage 119.63 ms /cgroup/a/cpuacct.usage 119.60 ms /cgroup/a/cpuacct.usage The large deficit here is due to quota generations (/intentionally/) preventing us from now using previously stranded slack quota. The cost is that this quota becomes unavailable. with quota generations and quota return: 200.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 198.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 200.06 ms /cgroup/a/cpuacct.usage By returning unused quota we're able to both stably consume our desired quota and prevent unintentional overages due to the abuse of slack quota from previous quota periods (especially on a large machine). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add exports tracking cfs bandwidth control statisticsNikhil Rao
This change introduces statistics exports for the cpu sub-system, these are added through the use of a stat file similar to that exported by other subsystems. The following exports are included: nr_periods: number of periods in which execution occurred nr_throttled: the number of periods above in which execution was throttle throttled_time: cumulative wall-time that any cpus have been throttled for this group Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Throttle entities exceeding their allowed bandwidthPaul Turner
With the machinery in place to throttle and unthrottle entities, as well as handle their participation (or lack there of) we can now enable throttling. There are 2 points that we must check whether it's time to set throttled state: put_prev_entity() and enqueue_entity(). - put_prev_entity() is the typical throttle path, we reach it by exceeding our allocated run-time within update_curr()->account_cfs_rq_runtime() and going through a reschedule. - enqueue_entity() covers the case of a wake-up into an already throttled group. In this case we know the group cannot be on_rq and can throttle immediately. Checks are added at time of put_prev_entity() and enqueue_entity() Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Migrate throttled tasks on HOTPLUGPaul Turner
Throttled tasks are invisisble to cpu-offline since they are not eligible for selection by pick_next_task(). The regular 'escape' path for a thread that is blocked at offline is via ttwu->select_task_rq, however this will not handle a throttled group since there are no individual thread wakeups on an unthrottle. Resolve this by unthrottling offline cpus so that threads can be migrated. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent buddy interactions with throttled entitiesPaul Turner
Buddies allow us to select "on-rq" entities without actually selecting them from a cfs_rq's rb_tree. As a result we must ensure that throttled entities are not falsely nominated as buddies. The fact that entities are dequeued within throttle_entity is not sufficient for clearing buddy status as the nomination may occur after throttling. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent interactions with throttled entitiesPaul Turner
From the perspective of load-balance and shares distribution, throttled entities should be invisible. However, both of these operations work on 'active' lists and are not inherently aware of what group hierarchies may be present. In some cases this may be side-stepped (e.g. we could sideload via tg_load_down in load balance) while in others (e.g. update_shares()) it is more difficult to compute without incurring some O(n^2) costs. Instead, track hierarchicaal throttled state at time of transition. This allows us to easily identify whether an entity belongs to a throttled hierarchy and avoid incorrect interactions with it. Also, when an entity leaves a throttled hierarchy we need to advance its time averaging for shares averaging so that the elapsed throttled time is not considered as part of the cfs_rq's operation. We also use this information to prevent buddy interactions in the wakeup and yield_to() paths. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Allow for positional tg_tree walksPaul Turner
Extend walk_tg_tree to accept a positional argument static int walk_tg_tree_from(struct task_group *from, tg_visitor down, tg_visitor up, void *data) Existing semantics are preserved, caller must hold rcu_lock() or sufficient analogue. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add support for unthrottling group entitiesPaul Turner
At the start of each period we refresh the global bandwidth pool. At this time we must also unthrottle any cfs_rq entities who are now within bandwidth once more (as quota permits). Unthrottled entities have their corresponding cfs_rq->throttled flag cleared and their entities re-enqueued. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>