git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2010-10-19	Fixing ethernet driver compilation error for i.MX31 ADS board	Ian Lartey
	This is only a partial revert of "ARM: mx3/mx31ads: fold board header in its only user" [commit ccfa7c269843001077df02d98918c6c9bde91395)] As some of the the board defines are also used in the cs89x0 ethernet driver by the i.MX31 ADS. Signed-off-by: Ian Lartey <ian@opensource.wolfsonmicro.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2010-10-19	cpuimx51: update board support	Eric Bénard
	add NAND, SDHC Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	mx5: add cpuimx51sd module and its baseboard	Eric Bénard
	Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	iomux-mx51: fix GPIO_1_xx 's IOMUX configuration	Eric Bénard
	this patch really configure the GPIO in GPIO mode. Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	imx-esdhc: update devices registration	Eric Bénard
	Tested on i.MX25 and i.MX35 and i.MX51 Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	mx51: add resources for SD/MMC on i.MX51	Eric Bénard
	the attached patch allows SD to work on i.MX51 with Wolfram's drivers Tested on i.MX51. Based on original patch from: Richard Zhu <r65037@freescale.com> Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	iomux-mx51: fix SD1 and SD2's iomux configuration	Eric Bénard
	Based on original patch from: Richard Zhu <r65037@freescale.com> Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability	Eric Bénard
	Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	clock-mx51: factorize clk_set_parent and clk_get_rate	Eric Bénard
	Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	eukrea_mbimxsd: add support for DVI displays	Eric Bénard
	Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	cpuimx25 & cpuimx35: fix OTG port registration in host mode	Eric Bénard
	the PHY is UTMI so don't create an ULPI viewpoint. Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472	Eric Bénard
	Without this exiting WFI can result in cache corruption. Code taken from Freescale's 2.6.27 BSP and tested on i.MX35 Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	mx25: fix compile error in platform-imx-dma.c	Eric Bénard
	this patch fix the following errors : arch/arm/plat-mxc/devices/platform-imx-dma.c:44: error: ‘MX25_SDMA_BASE_ADDR’ undeclared here (not in a function) arch/arm/plat-mxc/devices/platform-imx-dma.c:44: error: ‘MX25_INT_SDMA’ undeclared here (not in a function) Signed-off-by: Eric Bénard <eric@eukrea.com> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
2010-10-19	mx25: fix clock's calculation	Eric Bénard
	* get_rate_arm : when 400MHz clock is selected (cctl & 1<<14), ARM clock is 400MHz (MPLL * 3 / 4) and not 800MHz * get_rate_per : peripherals's clock is derived from AHB and not from IPG (ref manual : figure 5-1) * can2_clk : use the correct ID * without this patch, peripherals getting their clock from PER clocks work fine because of the 2 errors which fix themselves (ARM clock x 2 and per clock actually based on IPG which is AHB/2) but flexcan can't work as it gets its clock from IPG and thus calculates its bitrate using a reference value which is twice what it really is. Signed-off-by: Eric Bénard <eric@eukrea.com>
2010-10-19	ARM: imx: add lost 3rd imx-i2c device for mx35	Marc Kleine-Budde
	During the reorganisation of the imx-i2c devices (in 64de5ec168d9743903e6ec482c3e9f37af49f9c1) the 3rd imx-i2c device for the mx35 got lost. This patch adds the missing device. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2010-10-19	ARM: imx: Add iram allocator functions	Dinh Nguyen
	Add IRAM(Internal RAM) allocation functions using GENERIC_ALLOCATOR. The allocation size is 4KB multiples to guarantee alignment. The idea for these functions is for i.MX platforms to use them to dynamically allocate IRAM usage. Applies on 2.6.36-rc7 Signed-off-by: Dinh Nguyen <Dinh.Nguyen@freescale.com> Reviewed-by: Amit Kucheria <amit.kucheria@canonical.com> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2010-10-19	KVM: Fix fs/gs reload oops with invalid ldt	Avi Kivity
	kvm reloads the host's fs and gs blindly, however the underlying segment descriptors may be invalid due to the user modifying the ldt after loading them. Fix by using the safe accessors (loadsegment() and load_gs_index()) instead of home grown unsafe versions. This is CVE-2010-3698. KVM-Stable-Tag. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-19	tracing: Fix compile issue for trace_sched_wakeup.c	Steven Rostedt
	The function start_func_tracer() was incorrectly added in the #ifdef CONFIG_FUNCTION_TRACER condition, but is still used even when function tracing is not enabled. The calls to register_ftrace_function() and register_ftrace_graph() become nops (and their arguments are even ignored), thus there is no reason to hide start_func_tracer() when function tracing is not enabled. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-10-19	[S390] hardirq: remove pointless header file includes	Heiko Carstens
	Remove a couple of pointless header file includes. Fixes a compile bug caused by header file include dependencies with "irq: Add tracepoint to softirq_raise" within linux-next. Reported-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [ cherry-picked from the s390 tree to fix "2bf2160: irq: Add tracepoint to softirq_raise" ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-19	[IA64] Move local_softirq_pending() definition	Tony Luck
	Ugly #include dependencies. We need to have local_softirq_pending() defined before it gets used in <linux/interrupt.h>. But <asm/hardirq.h> provides the definition after this #include chain: <linux/irq.h> <asm/irq.h> <asm/hw_irq.h> <linux/interrupt.h> Signed-off-by: Tony Luck <tony.luck@intel.com> [ cherry-picked from the ia64 tree to fix "2bf2160: irq: Add tracepoint to softirq_raise" ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-19	futex: Fix errors in nested key ref-counting	Darren Hart
	futex_wait() is leaking key references due to futex_wait_setup() acquiring an additional reference via the queue_lock() routine. The nested key ref-counting has been masking bugs and complicating code analysis. queue_lock() is only called with a previously ref-counted key, so remove the additional ref-counting from the queue_(un)lock() functions. Also futex_wait_requeue_pi() drops one key reference too many in unqueue_me_pi(). Remove the key reference handling from unqueue_me_pi(). This was paired with a queue_lock() in futex_lock_pi(), so the count remains unchanged. Document remaining nested key ref-counting sites. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com> Reported-by: Louis Rilling<louis.rilling@kerlabs.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <4CBB17A8.70401@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2010-10-19	x86: ioapic: Call free_irte only if interrupt remapping enabled	Yinghai Lu
	On a system that support intr-rempping when booting with "intremap=off" [ 177.895501] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8 [ 177.913316] IP: [<ffffffff8145fc18>] free_irte+0x47/0xc0 ... [ 178.173326] Call Trace: [ 178.173574] [<ffffffff810515b4>] destroy_irq+0x3a/0x75 [ 178.192934] [<ffffffff81051834>] arch_teardown_msi_irq+0xe/0x10 [ 178.193418] [<ffffffff81458dc3>] arch_teardown_msi_irqs+0x56/0x7f [ 178.213021] [<ffffffff81458e79>] free_msi_irqs+0x8d/0xeb Call free_irte only when interrupt remapping is enabled. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4CBCB274.7010108@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-10-19	perf, powerpc: Fix power_pmu_event_init to not use event->ctx	Paul Mackerras
	Commit c3f00c70 ("perf: Separate find_get_context() from event initialization") changed the generic perf_event code to call perf_event_alloc, which calls the arch-specific event_init code, before looking up the context for the new event. Unfortunately, power_pmu_event_init uses event->ctx->task to see whether the new event is a per-task event or a system-wide event, and thus crashes since event->ctx is NULL at the point where power_pmu_event_init gets called. (The reason it needs to know whether it is a per-task event is because there are some hardware events on Power systems which only count when the processor is not idle, and there are some fixed-function counters which count such events. For example, the "run cycles" event counts cycles when the processor is not idle. If the user asks to count cycles, we can use "run cycles" if this is a per-task event, since the processor is running when the task is running, by definition. We can't use "run cycles" if the user asks for "cycles" on a system-wide counter.) Fortunately the information we need is in the event->attach_state field, so we just use that instead. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20101019055535.GA10398@drongo> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Alexey Kardashevskiy <aik@au1.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-19	Merge branch 'tip/perf/recordmcount-2' of ↵	Ingo Molnar
	git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2010-10-19	ARM: S5PV310: Fix build error on GPIO map	Kukjin Kim
	This patch fixes build error about GPIO address due to conflict of commit 4d914705 and 19a2c065. - commit 4d914705: Fix on GPIO base addresses - commit 19a2c065: Moves initial map for merging S5P64X0 Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
2010-10-18	Merge branch 'hotplug' into devel	Russell King
	Conflicts: arch/arm/kernel/head-common.S
2010-10-18	Merge branches 'at91', 'dcache', 'ftrace', 'hwbpt', 'misc', 'mmci', 's3c', ↵	Russell King
	'st-ux' and 'unwind' into devel
2010-10-18	ftrace: Remove recursion between recordmcount and scripts/mod/empty	Steven Rostedt
	When DYNAMIC_FTRACE is enabled and we use the C version of recordmcount, all objects are run through the recordmcount program to create a separate section that stores all the callers of mcount. The build process has a special file: scripts/mod/empty.o. This is built from empty.c which is literally an empty file (except for a single comment). This file is used to find information about the target elf format, like endianness and word size. The problem comes up when we need to build recordmcount. The build process requires that empty.o is built first. The build rules for empty.o will try to execute recordmcount on the empty.o file. We get an error that recordmcount does not exist. To avoid this recursion, the build file will skip running recordmcount if the file that it is building is script/mod/empty.o. [ extra comment Suggested-by: Sam Ravnborg <sam@ravnborg.org> ] Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Ingo Molnar <mingo@elte.hu> Cc: Michal Marek <mmarek@suse.cz> Cc: linux-kbuild@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-10-18	ARM: 6441/1: ux500: The platform is not just based on early drop silicon ↵	Srinidhi Kasagar
	version. Update Kconfig text accordingly. Signed-off-by: srinidhi kasagar <srinidhi.kasagar@stericsson.com> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-10-18	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus	Linus Torvalds
	* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: MIPS: Enable ISA_DMA_API config to fix build failure MIPS: 32-bit: Fix build failure in asm/fcntl.h MIPS: Remove all generated vmlinuz* files on "make clean" MIPS: do_sigaltstack() expects userland pointers MIPS: Fix error values in case of bad_stack MIPS: Sanitize restart logics MIPS: secure_computing, syscall audit: syscall number should in r2, not r0. MIPS: Don't block signals if we'd failed to setup a sigframe
2010-10-18	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: evdev - fix EVIOCSABS regression Input: evdev - fix Ooops in EVIOCGABS/EVIOCSABS
2010-10-18	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: firewire: ohci: fix TI TSB82AA2 regression since 2.6.35
2010-10-18	mxc_nand: do not depend on disabling the irq in the interrupt handler	Sascha Hauer
	This patch reverts the driver to enabling/disabling the NFC interrupt mask rather than enabling/disabling the system interrupt. This cleans up the driver so that it doesn't rely on interrupts being disabled within the interrupt handler. For i.MX21 we keep the current behaviour, that is calling enable_irq/disable_irq_nosync to enable/disable interrupts. This patch is based on earlier work by John Ogness. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: John Ogness <john.ogness@linutronix.de> Tested-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-18	Merge branch 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux	Linus Torvalds
	* 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux: i2c-imx: do not allow interruptions when waiting for I2C to complete i2c-davinci: Fix TX setup for more SoCs
2010-10-18	Merge branch 'fixes'	Linus Torvalds
	* fixes: v4l1: fix 32-bit compat microcode loading translation De-pessimize rds_page_copy_user
2010-10-18	sched: Export account_system_vtime()	Ingo Molnar
	KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Call tick_check_idle before __irq_enter	Venkatesh Pallipadi
	When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Remove irq time from available CPU power	Venkatesh Pallipadi
	The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Do not account irq time to current task	Venkatesh Pallipadi
	Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	x86: Add IRQ_TIME_ACCOUNTING	Venkatesh Pallipadi
	This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it when TSC is enabled. This change just enables fine grained irq time accounting, isn't used yet. Following patches use it for different purposes. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time	Venkatesh Pallipadi
	s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Add a PF flag for ksoftirqd identification	Venkatesh Pallipadi
	To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Consolidate account_system_vtime extern declaration	Venkatesh Pallipadi
	Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Fix softirq time accounting	Venkatesh Pallipadi
	Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Drop group_capacity to 1 only if local group has extra capacity	Nikhil Rao
	When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Force balancing on newidle balance if local group has capacity	Nikhil Rao
	This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Set group_imb only a task can be pulled from the busiest cpu	Nikhil Rao
	When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com> [ small code readability edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	sched: Do not consider SCHED_IDLE tasks to be cache hot	Nikhil Rao
	This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	jump_label: Add COND_STMT(), reducer wrappery	Peter Zijlstra
	The use of the JUMP_LABEL() construct ends up creating endless silly wrappers, create a higher level construct to reduce this clutter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18	perf: Optimize sw events	Peter Zijlstra
	Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>