summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2015-06-02Merge branch 'linus' into sched/core, to resolve conflictIngo Molnar
Conflicts: arch/sparc/include/asm/topology_64.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-22Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Three small fixes that have been picked up the last few weeks. Specifically: - Fix a memory corruption issue in NVMe with malignant user constructed request. From Christoph. - Kill (now) unused blk_queue_bio(), dm was changed to not need this anymore. From Mike Snitzer. - Always use blk_schedule_flush_plug() from the io_schedule() path when flushing a plug, fixing a !TASK_RUNNING warning with md. From Shaohua" * 'for-linus' of git://git.kernel.dk/linux-block: sched: always use blk_schedule_flush_plug in io_schedule_out nvme: fix kernel memory corruption with short INQUIRY buffers block: remove export for blk_queue_bio
2015-05-19Merge branch 'linus' into timers/coreThomas Gleixner
Make sure the upstream fixes are applied before adding further modifications.
2015-05-19sched/numa: Reduce conflict between fbq_classify_rq() and migrationRik van Riel
It is possible for fbq_classify_rq() to indicate that a CPU has tasks that should be moved to another NUMA node, but for migrate_improves_locality and migrate_degrades_locality to not identify those tasks. This patch always gives preference to preferred node evaluations, and only checks the number of faults when evaluating moves between two non-preferred nodes on a larger NUMA system. On a two node system, the number of faults is never evaluated. Either a task is about to be pulled off its preferred node, or migrated onto it. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/20150514225936.35b91717@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19sched/preempt: Optimize preemption operations on __schedule() callersFrederic Weisbecker
__schedule() disables preemption and some of its callers (the preempt_schedule*() family) also set PREEMPT_ACTIVE. So we have two preempt_count() modifications that could be performed at once. Lets remove the preemption disablement from __schedule() and pull this responsibility to its callers in order to optimize preempt_count() operations in a single place. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431441711-29753-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19locking/arch: Rename set_mb() to smp_store_mb()Peter Zijlstra
Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier like mb() rename it to match the recent smp_load_acquire() and smp_store_release(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-18sched: always use blk_schedule_flush_plug in io_schedule_outShaohua Li
block plug callback could sleep, so we introduce a parameter 'from_schedule' and corresponding drivers can use it to destinguish a schedule plug flush or a plug finish. Unfortunately io_schedule_out still uses blk_flush_plug(). This causes below output (Note, I added a might_sleep() in raid1_unplug to make it trigger faster, but the whole thing doesn't matter if I add might_sleep). In raid1/10, this can cause deadlock. This patch makes io_schedule_out always uses blk_schedule_flush_plug. This should only impact drivers (as far as I know, raid 1/10) which are sensitive to the 'from_schedule' parameter. [ 370.817949] ------------[ cut here ]------------ [ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90() [ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90 [ 370.817971] Modules linked in: raid1 [ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361 [ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1) [ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001 [ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8 [ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28 [ 370.817993] Call Trace: [ 370.817999] [<ffffffff819dd7af>] dump_stack+0x4f/0x7b [ 370.818002] [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0 [ 370.818004] [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50 [ 370.818006] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818008] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818010] [<ffffffff810776ef>] __might_sleep+0x7f/0x90 [ 370.818014] [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1] [ 370.818024] [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0 [ 370.818028] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818031] [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140 [ 370.818033] [<ffffffff819e3586>] bit_wait_io+0x36/0x50 [ 370.818034] [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90 [ 370.818041] [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630 [ 370.818043] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818045] [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80 [ 370.818047] [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40 [ 370.818050] [<ffffffff811de744>] __wait_on_buffer+0x44/0x50 [ 370.818053] [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0 [ 370.818058] [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790 [ 370.818062] [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50 [ 370.818064] [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200 [ 370.818066] [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360 [ 370.818068] [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0 [ 370.818070] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818072] [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460 [ 370.818074] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818076] [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620 [ 370.818079] [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0 [ 370.818081] [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290 [ 370.818085] [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0 [ 370.818088] [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50 [ 370.818094] [<ffffffff81149691>] do_writepages+0x21/0x50 [ 370.818097] [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490 [ 370.818099] [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590 [ 370.818103] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818105] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818107] [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0 [ 370.818109] [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0 [ 370.818111] [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550 [ 370.818116] [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570 [ 370.818117] [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570 [ 370.818119] [<ffffffff8106c09b>] worker_thread+0x11b/0x470 [ 370.818121] [<ffffffff8106bf80>] ? process_one_work+0x570/0x570 [ 370.818124] [<ffffffff81071868>] kthread+0xf8/0x110 [ 370.818126] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818129] [<ffffffff819e9322>] ret_from_fork+0x42/0x70 [ 370.818131] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818132] ---[ end trace 7b4deb71e68b6605 ]--- V2: don't change ->in_iowait Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-18sched,perf: Fix periodic timersPeter Zijlstra
In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
2015-05-17sched: Fix function declaration return type mismatchNicholas Mc Guire
static code checking was unhappy with: ./kernel/sched/fair.c:162 WARNING: return of wrong type int != unsigned int get_update_sysctl_factor() is declared to return int but is currently returning an unsigned int. The first few preprocessed lines are: static int get_update_sysctl_factor(void) { unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask)); int __min2 = (8); __min1 < __min2 ? __min1: __min2; }); unsigned int factor; The type used by min_t() should be 'unsigned int' and the return type of get_update_sysctl_factor() should also be 'unsigned int' as its call-site update_sysctl() is expecting 'unsigned int' and the values utilizing: 'factor' 'sysctl_sched_min_granularity' 'sched_nr_latency' 'sysctl_sched_wakeup_granularity' ... are also all 'unsigned int', plus cpumask_weight() is also returning 'unsigned int'. So the natural type to use around here is 'unsigned int'. ( Patch was compile tested with x86_64_defconfig + CONFIG_SCHED_DEBUG=y and the changed sections in kernel/sched/fair.i were reviewed. ) Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> [ Improved the changelog a bit. ] Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-14sched / idle: Call default_idle_call() from cpuidle_enter_state()Rafael J. Wysocki
The check of the cpuidle_enter() return value against -EBUSY made in call_cpuidle() will not be necessary any more if cpuidle_enter_state() calls default_idle_call() directly when it is about to return -EBUSY, so make that happen and eliminate the check. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Kevin Hilman <khilman@linaro.org>
2015-05-14sched / idle: Call idle_set_state() from cpuidle_enter_state()Rafael J. Wysocki
Introduce a wrapper function around idle_set_state() called sched_idle_set_state() that will pass this_rq() to it as the first argument and make cpuidle_enter_state() call the new function before and after entering the target state. At the same time, remove direct invocations of idle_set_state() from call_cpuidle(). This will allow the invocation of default_idle_call() to be moved from call_cpuidle() to cpuidle_enter_state() safely and call_cpuidle() to be simplified a bit as a result. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Kevin Hilman <khilman@linaro.org>
2015-05-08perf: Fix software migrate eventsPeter Zijlstra
Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it was borken: > The problem is that the task isn't actually scheduled while its being > migrated (obviously), and if its not scheduled, the counters aren't > scheduled either, so there's no observing of the fact. > > A further problem with migrations is that many migrations happen from > softirq context, which is nested inside the 'random' task context of > whoemever happens to run at that time, similarly for the wakeup > migrations triggered from (soft)irq context. All those end up being > accounted in the task that's currently running, eg. your 'ls'. The below cures this by marking a task as migrated and accounting it on the subsequent sched_in(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Implement lockless wake-queuesPeter Zijlstra
This is useful for locking primitives that can effect multiple wakeups per operation and want to avoid lock internal lock contention by delaying the wakeups until we've released the lock internal locks. Alternatively it can be used to avoid issuing multiple wakeups, and thus save a few cycles, in packet processing. Queue all target tasks and wakeup once you've processed all packets. That way you avoid waking the target task multiple times if there were multiple packets for the same task. Properties of a wake_q are: - Lockless, as queue head must reside on the stack. - Being a queue, maintains wakeup order passed by the callers. This can be important for otherwise, in scenarios where highly contended locks could affect any reliance on lock fairness. - A queued task cannot be added again until it is woken up. This patch adds the needed infrastructure into the scheduler code and uses the new wake_list to delay the futex wakeups until after we've released the hash bucket locks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [tweaks, adjustments, comments, etc.] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Mason <clm@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: George Spelvin <linux@horizon.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched, timer: Use the atomic task_cputime in thread_group_cputimerJason Low
Recent optimizations were made to thread_group_cputimer to improve its scalability by keeping track of cputime stats without a lock. However, the values were open coded to the structure, causing them to be at a different abstraction level from the regular task_cputime structure. Furthermore, any subsequent similar optimizations would not be able to share the new code, since they are specific to thread_group_cputimer. This patch adds the new task_cputime_atomic data structure (introduced in the previous patch in the series) to thread_group_cputimer for keeping track of the cputime atomically, which also helps generalize the code. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Waiman Long <Waiman.Long@hp.com> Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to ↵Jason Low
improve scalability While running a database workload, we found a scalability issue with itimers. Much of the problem was caused by the thread_group_cputimer spinlock. Each time we account for group system/user time, we need to obtain a thread_group_cputimer's spinlock to update the timers. On larger systems (such as a 16 socket machine), this caused more than 30% of total time spent trying to obtain this kernel lock to update these group timer stats. This patch converts the timers to 64-bit atomic variables and use atomic add to update them without a lock. With this patch, the percent of total time spent updating thread group cputimer timers was reduced from 30% down to less than 1%. Note: On 32-bit systems using the generic 64-bit atomics, this causes sample_group_cputimer() to take locks 3 times instead of just 1 time. However, we tested this patch on a 32-bit system ARM system using the generic atomics and did not find the overhead to be much of an issue. An explanation for why this isn't an issue is that 32-bit systems usually have small numbers of CPUs, and cacheline contention from extra spinlocks called periodically is not really apparent on smaller systems. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Waiman Long <Waiman.Long@hp.com> Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/numa: Document usages of mm->numa_scan_seqJason Low
The p->mm->numa_scan_seq is accessed using READ_ONCE/WRITE_ONCE and modified without exclusive access. It is not clear why it is accessed this way. This patch provides some documentation on that. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Waiman Long <waiman.long@hp.com> Link: http://lkml.kernel.org/r/1430440094.2475.61.camel@j-VirtualBox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to ↵Jason Low
READ_ONCE()/WRITE_ONCE() ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use the new READ_ONCE() and WRITE_ONCE() APIs as appropriate. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Remove unnecessary down/up conversionNicholas Mc Guire
'rt_period_us' is automatically type converted from u64 to long and then cast back to u64 - this down/up conversion is unnecessary and can be removed to improve readability. This will also help us not truncate 'rt_period_us' to 32 bits on 32-bit kernels, should we ever have so large values. (unlikely, not the least due to procfs.) Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430643116-24049-1-git-send-email-hofrat@osadl.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Move the loadavg code to a more obvious locationPeter Zijlstra
I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08Merge branch 'sched/urgent' into sched/core, before applying new patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Remove __cpuinit section tag that crept back inPaul Gortmaker
We removed __cpuinit support (leaving no-op stubs) quite some time ago. However this one crept back in as of commit a803f0261bb2bb57aab ("sched: Initialize rq->age_stamp on processor start") Since we want to clobber the stubs too, get this removed now. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Corey Minyard <cminyard@mvista.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430174880-27958-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Fix regression in cpuset_cpu_inactive() for suspendOmar Sandoval
Commit 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()"), a SCHED_DEADLINE bugfix, had a logic error that caused a regression in setting a CPU inactive during suspend. I ran into this when a program was failing pthread_setaffinity_np() with EINVAL after a suspend+wake up. A simple reproducer: $ ./a.out sched_setaffinity: Success $ systemctl suspend $ ./a.out sched_setaffinity: Invalid argument ... where ./a.out is: #define _GNU_SOURCE #include <errno.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> int main(void) { long num_cores; cpu_set_t cpu_set; int ret; num_cores = sysconf(_SC_NPROCESSORS_ONLN); CPU_ZERO(&cpu_set); CPU_SET(num_cores - 1, &cpu_set); errno = 0; ret = sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set); perror("sched_setaffinity"); return ret ? EXIT_FAILURE : EXIT_SUCCESS; } The mistake is that suspend is handled in the action == CPU_DOWN_PREPARE_FROZEN case of the switch statement in cpuset_cpu_inactive(). However, the commit in question masked out CPU_TASKS_FROZEN from the action, making this case dead. The fix is straightforward. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()") Link: http://lkml.kernel.org/r/1cb5ecb3d6543c38cce5790387f336f54ec8e2bc.1430733960.git.osandov@osandov.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Handle priority boosted tasks proper in setscheduler()Thomas Gleixner
Ronny reported that the following scenario is not handled correctly: T1 (prio = 10) lock(rtmutex); T2 (prio = 20) lock(rtmutex) boost T1 T1 (prio = 20) sys_set_scheduler(prio = 30) T1 prio = 30 .... sys_set_scheduler(prio = 10) T1 prio = 30 The last step is wrong as T1 should now be back at prio 20. Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()") only handles the case where a boosted tasks tries to lower its priority. Fix it by taking the new effective priority into account for the decision whether a change of the priority is required. Reported-by: Ronny Meeus <ronny.meeus@gmail.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()") Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08Merge branch 'sched/urgent' into sched/coreIngo Molnar
So this isn't really a fix but a cleanup that can wait for v4.2. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07nohz: Set isolcpus when nohz_full is setChris Metcalf
nohz_full is only useful with isolcpus are also set, since otherwise the scheduler has to run periodically to try to determine whether to steal work from other cores. Accordingly, when booting with nohz_full=xxx on the command line, we should act as if isolcpus=xxx was also set, and set (or extend) the isolcpus set to include the nohz_full cpus. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Jones <davej@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430928266-24888-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07context_tracking: Inherit TIF_NOHZ through forks instead of context switchesFrederic Weisbecker
TIF_NOHZ is used by context_tracking to force syscall slow-path on every task in order to track userspace roundtrips. As such, it must be set on all running tasks. It's currently explicitly inherited through context switches. There is no need to do it in this fast-path though. The flag could simply be set once for all on all tasks, whether they are running or not. Lets do this by setting the flag for the init task on early boot, and let it propagate through fork inheritance. While at it, mark context_tracking_cpu_set() as init code, we only need it at early boot time. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Dave Jones <davej@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430928266-24888-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-04sched / idle: Eliminate the "reflect" check from cpuidle_idle_call()Rafael J. Wysocki
Since cpuidle_reflect() should only be called if the idle state to enter was selected by cpuidle_select(), there is the "reflect" variable in cpuidle_idle_call() whose value is used to determine whether or not that is the case. However, if the entire code run between the conditional setting "reflect" and the call to cpuidle_reflect() is moved to a separate function, it will be possible to call that new function in both branches of the conditional, in which case cpuidle_reflect() will only need to be called from one of them too and the "reflect" variable won't be necessary any more. This eliminates one check made by cpuidle_idle_call() on the majority of its invocations, so change the code as described. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2015-05-04sched / idle: Move the default idle call code to a separate functionRafael J. Wysocki
Move the code under the "use_default" label in cpuidle_idle_call() into a separate (new) function. This just allows the subsequent changes to be more stratightforward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2015-04-30Merge tag 'pm+acpi-4.1-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "Three regression fixes this time, one for a recent regression in the cpuidle core affecting multiple systems, one for an inadvertently added duplicate typedef in ACPICA that breaks compilation with GCC 4.5 and one for an ACPI Smart Battery Subsystem driver regression introduced during the 3.18 cycle (stable-candidate). Specifics: - Fix for a regression in the cpuidle core introduced by one of the recent commits in the clockevents_notify() removal series that put a call to a function which had to be executed with disabled interrupts into a code path running with enabled interrupts (Rafael J Wysocki) - Fix for a build problem in ACPICA (with GCC 4.5) introduced by one of the recent ACPICA tools commits that added a duplicate typedef to one of the ACPICA's header files by mistake (Olaf Hering) - Fix for a regression in the ACPI SBS (Smart Battery Subsystem) driver introduced during the 3.18 development cycle causing the smart battery manager to be marked as not present when it should be marked as present (Chris Bainbridge)" * tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: Run tick_broadcast_exit() with disabled interrupts ACPI / SBS: Enable battery manager when present ACPICA: remove duplicate u8 typedef
2015-04-29cpuidle: Run tick_broadcast_exit() with disabled interruptsRafael J. Wysocki
Commit 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function) replaced clockevents_notify() invocations in cpuidle_idle_call() with direct calls to tick_broadcast_enter() and tick_broadcast_exit(), but it overlooked the fact that interrupts were already enabled before calling the latter which led to functional breakage on systems using idle states with the CPUIDLE_FLAG_TIMER_STOP flag set. Fix that by moving the invocations of tick_broadcast_enter() and tick_broadcast_exit() down into cpuidle_enter_state() where interrupts are still disabled when tick_broadcast_exit() is called. Also ensure that interrupts will be disabled before running tick_broadcast_exit() even if they have been enabled by the idle state's ->enter callback. Trigger a WARN_ON_ONCE() in that case, as we generally don't want that to happen for states with CPUIDLE_FLAG_TIMER_STOP set. Fixes: 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function) Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-27x86: pvclock: Really remove the sched notifier for cross-cpu migrationsPaolo Bonzini
This reverts commits 0a4e6be9ca17c54817cf814b4b5aa60478c6df27 and 80f7fdb1c7f0f9266421f823964fd1962681f6ce. The task migration notifier was originally introduced in order to support the pvclock vsyscall with non-synchronized TSC, but KVM only supports it with synchronized TSC. Hence, on KVM the race condition is only needed due to a bad implementation on the host side, and even then it's so rare that it's mostly theoretical. As far as KVM is concerned it's possible to fix the host, avoiding the additional complexity in the vDSO and the (re)introduction of the task migration notifier. Xen, on the other hand, hasn't yet implemented vsyscall support at all, so we do not care about its plans for non-synchronized TSC. Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-23sched: debug: Remove the cfs bandwidth timer_active printoutThomas Gleixner
The struct member is gone. Reported-by: fengguang.wu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2015-04-22sched: Cleanup bandwidth timersPeter Zijlstra
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth(). The more I look at that code the more I'm convinced its crack, that entire __start_cfs_bandwidth() thing is brain melting, we don't need to cancel a timer before starting it, *hrtimer_start*() will happily remove the timer for you if its still enqueued. Removing that, removes a big part of the problem, no more ugly cancel loop to get stuck in. So now, if I understand things right, the entire reason you have this cfs_b->lock guarded ->timer_active nonsense is to make sure we don't accidentally lose the timer. It appears to me that it should be possible to guarantee that same by unconditionally (re)starting the timer when !queued. Because regardless what hrtimer::function will return, if we beat it to (re)enqueue the timer, it doesn't matter. Now, because hrtimers don't come with any serialization guarantees we must ensure both handler and (re)start loop serialize their access to the hrtimer to avoid both trying to forward the timer at the same time. Update the rt bandwidth timer to match. This effectively reverts: 09dc4ab03936 ("sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock"). Reported-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22sched: deadline: Use hrtimer_start()Thomas Gleixner
hrtimer_start() does not longer defer already expired timers to the softirq. Get rid of the __hrtimer_start_range_ns() invocation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203502.627353666@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22sched: core: Use hrtimer_start[_expires]()Thomas Gleixner
hrtimer_start() now enforces a timer interrupt when an already expired timer is enqueued. Get rid of the __hrtimer_start_range_ns() invocations and the loops around it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203502.531131739@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-16sched/autogroup: Remove unnecessary #ifdef guardsTobias Klauser
Since commit: 029632fbb7b7 ("sched: Make separate sched*.c translation units") autogroup is a separate translation unit built from the Makefile and thus no longer needs its content wrapped with #ifdef CONFIG_SCHED_AUTOGROUP. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1429116678-17000-1-git-send-email-tklauser@distanz.ch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-14Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NOHZ changes from Ingo Molnar: "This tree adds full dynticks support to KVM guests (support the disabling of the timer tick on the guest). The main missing piece was the recognition of guest execution as RCU extended quiescent state and related changes" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest context_tracking: Export context_tracking_user_enter/exit context_tracking: Run vtime_user_enter/exit only when state == CONTEXT_USER context_tracking: Add stub context_tracking_is_enabled context_tracking: Generalize context tracking APIs to support user and guest context_tracking: Rename context symbols to prepare for transition state ppc: Remove unused cpp symbols in kvm headers
2015-04-14Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - changes permitting use of call_rcu() and friends very early in boot, for example, before rcu_init() is invoked. - add in-kernel API to enable and disable expediting of normal RCU grace periods. - improve RCU's handling of (hotplug-) outgoing CPUs. - NO_HZ_FULL_SYSIDLE fixes. - tiny-RCU updates to make it more tiny. - documentation updates. - miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well cpu: Defer smpboot kthread unparking until CPU known to scheduler rcu: Associate quiescent-state reports with grace period rcu: Yet another fix for preemption and CPU hotplug rcu: Add diagnostics to grace-period cleanup rcutorture: Default to grace-period-initialization delays rcu: Handle outgoing CPUs on exit from idle loop cpu: Make CPU-offline idle-loop transition point more precise rcu: Eliminate ->onoff_mutex from rcu_node structure rcu: Process offlining and onlining only at grace-period start rcu: Move rcu_report_unblock_qs_rnp() to common code rcu: Rework preemptible expedited bitmask handling rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs rcutorture: Enable slow grace-period initializations rcu: Provide diagnostic option to slow down grace-period initialization rcu: Detect stalls caused by failure to propagate up rcu_node tree rcu: Eliminate empty HOTPLUG_CPU ifdef rcu: Simplify sync_rcu_preempt_exp_init() rcu: Put all orphan-callback-related code under same comment rcu: Consolidate offline-CPU callback initialization ...
2015-04-13Merge branch 'for-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. Rik made cpuset cooperate better with isolcpus and there are several other cleanup patches" * 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset, isolcpus: document relationship between cpusets & isolcpus cpusets, isolcpus: exclude isolcpus from load balancing in cpusets sched, isolcpu: make cpu_isolated_map visible outside scheduler cpuset: initialize cpuset a bit early cgroup: Use kvfree in pidlist_free() cgroup: call cgroup_subsys->bind on cgroup subsys initialization
2015-04-13Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Ingo Molnar: "The main changes in this cycle were: - clockevents state machine cleanups and enhancements (Viresh Kumar) - clockevents broadcast notifier horror to state machine conversion and related cleanups (Thomas Gleixner, Rafael J Wysocki) - clocksource and timekeeping core updates (John Stultz) - clocksource driver updates and fixes (Ben Dooks, Dmitry Osipenko, Hans de Goede, Laurent Pinchart, Maxime Ripard, Xunlei Pang) - y2038 fixes (Xunlei Pang, John Stultz) - NMI-safe ktime_get_raw_fast() and general refactoring of the clock code, in preparation to perf's per event clock ID support (Peter Zijlstra) - generic sched/clock fixes, optimizations and cleanups (Daniel Thompson) - clockevents cpu_down() race fix (Preeti U Murthy)" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits) timers/PM: Drop unnecessary braces from tick_freeze() timers/PM: Fix up tick_unfreeze() timekeeping: Get rid of stale comment clockevents: Cleanup dead cpu explicitely clockevents: Make tick handover explicit clockevents: Remove broadcast oneshot control leftovers sched/idle: Use explicit broadcast oneshot control function ARM: Tegra: Use explicit broadcast oneshot control function ARM: OMAP: Use explicit broadcast oneshot control function intel_idle: Use explicit broadcast oneshot control function ACPI/idle: Use explicit broadcast control function ACPI/PAD: Use explicit broadcast oneshot control function x86/amd/idle, clockevents: Use explicit broadcast oneshot control functions clockevents: Provide explicit broadcast oneshot control functions clockevents: Remove the broadcast control leftovers ARM: OMAP: Use explicit broadcast control function intel_idle: Use explicit broadcast control function cpuidle: Use explicit broadcast control function ACPI/processor: Use explicit broadcast control function ACPI/PAD: Use explicit broadcast control function ...
2015-04-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Major changes: - Reworked CPU capacity code, for better SMP load balancing on systems with assymetric CPUs. (Vincent Guittot, Morten Rasmussen) - Reworked RT task SMP balancing to be push based instead of pull based, to reduce latencies on large CPU count systems. (Steven Rostedt) - SCHED_DEADLINE support updates and fixes. (Juri Lelli) - SCHED_DEADLINE task migration support during CPU hotplug. (Wanpeng Li) - x86 mwait-idle optimizations and fixes. (Mike Galbraith, Len Brown) - sched/numa improvements. (Rik van Riel) - various cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) sched/core: Drop debugging leftover trace_printk call sched/deadline: Support DL task migration during CPU hotplug sched/core: Check for available DL bandwidth in cpuset_cpu_inactive() sched/deadline: Always enqueue on previous rq when dl_task_timer() fires sched/core: Remove unused argument from init_[rt|dl]_rq() sched/deadline: Fix rt runtime corruption when dl fails its global constraints sched/deadline: Avoid a superfluous check sched: Improve load balancing in the presence of idle CPUs sched: Optimize freq invariant accounting sched: Move CFS tasks to CPUs with higher capacity sched: Add SD_PREFER_SIBLING for SMT level sched: Remove unused struct sched_group_capacity::capacity_orig sched: Replace capacity_factor by usage sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage sched: Add struct rq::cpu_capacity_orig sched: Make scale_rt invariant with frequency sched: Make sched entity usage tracking scale-invariant sched: Remove frequency scaling from cpu_capacity sched: Track group sched_entity usage contributions sched: Add sched_avg::utilization_avg_contrib ...
2015-04-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "First batch of KVM changes for 4.1 The most interesting bit here is irqfd/ioeventfd support for ARM and ARM64. Summary: ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling vhost, too), page aging s390: interrupt handling rework, allowing to inject all local interrupts via new ioctl and to get/set the full local irq state for migration and introspection. New ioctls to access memory by virtual address, and to get/set the guest storage keys. SIMD support. MIPS: FPU and MIPS SIMD Architecture (MSA) support. Includes some patches from Ralf Baechle's MIPS tree. x86: bugfixes (notably for pvclock, the others are small) and cleanups. Another small latency improvement for the TSC deadline timer" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits) KVM: use slowpath for cross page cached accesses kvm: mmu: lazy collapse small sptes into large sptes KVM: x86: Clear CR2 on VCPU reset KVM: x86: DR0-DR3 are not clear on reset KVM: x86: BSP in MSR_IA32_APICBASE is writable KVM: x86: simplify kvm_apic_map KVM: x86: avoid logical_map when it is invalid KVM: x86: fix mixed APIC mode broadcast KVM: x86: use MDA for interrupt matching kvm/ppc/mpic: drop unused IRQ_testbit KVM: nVMX: remove unnecessary double caching of MAXPHYADDR KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu KVM: vmx: pass error code with internal error #2 x86: vdso: fix pvclock races with task migration KVM: remove kvm_read_hva and kvm_read_hva_atomic KVM: x86: optimize delivery of TSC deadline timer interrupt KVM: x86: extract blocking logic from __vcpu_run kvm: x86: fix x86 eflags fixed bit KVM: s390: migrate vcpu interrupt state ...
2015-04-07mm: numa: disable change protection for vma(VM_HUGETLB)Naoya Horiguchi
Currently when a process accesses a hugetlb range protected with PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb subsystem into a broken/uncontrollable state, where for example h->resv_huge_pages is subtracted too much and wraps around to a very large number, and the free hugepage pool is no longer maintainable. This patch simply stops changing protection for vma(VM_HUGETLB) to fix the problem. And this also allows us to avoid useless overhead of minor faults. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-03sched/core: Drop debugging leftover trace_printk callBorislav Petkov
Commit: 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()") forgot a trace_printk() debugging piece in and Steve's banner screamed in dmesg. Remove it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1428050570-21041-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03sched/idle: Use explicit broadcast oneshot control functionThomas Gleixner
Replace the clockevents_notify() call with an explicit function call. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/6422336.RMm7oUHcXh@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/deadline: Support DL task migration during CPU hotplugWanpeng Li
I observed that DL tasks can't be migrated to other CPUs during CPU hotplug, in addition, task may/may not be running again if CPU is added back. The root cause which I found is that DL tasks will be throtted and removed from the DL rq after comsuming all their budget, which leads to the situation that stop task can't pick them up from the DL rq and migrate them to other CPUs during hotplug. The method to reproduce: schedtool -E -t 50000:100000 -e ./test Actually './test' is just a simple for loop. Then observe which CPU the test task is on and offline it: echo 0 > /sys/devices/system/cpu/cpuN/online This patch adds the DL task migration during CPU hotplug by finding a most suitable later deadline rq after DL timer fires if current rq is offline. If it fails to find a suitable later deadline rq then it falls back to any eligible online CPU in so that the deadline task will come back to us, and the push/pull mechanism should then move it around properly. Suggested-and-Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()Juri Lelli
Hotplug operations are destructive w.r.t. cpusets. In case such an operation is performed on a CPU belonging to an exlusive cpuset, the DL bandwidth information associated with the corresponding root domain is gone even if the operation fails (in sched_cpu_inactive()). For this reason we need to move the check we currently have in sched_cpu_inactive() to cpuset_cpu_inactive() to prevent useless cpusets reconfiguration in the CPU_DOWN_FAILED path. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Link: http://lkml.kernel.org/r/1427792017-7356-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/deadline: Always enqueue on previous rq when dl_task_timer() firesJuri Lelli
dl_task_timer() may fire on a different rq from where a task was removed after throttling. Since the call path is: dl_task_timer() -> enqueue_task_dl() -> enqueue_dl_entity() -> replenish_dl_entity() and replenish_dl_entity() uses dl_se's rq, we can't use current's rq in dl_task_timer(), but we need to lock the task's previous one. Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Juri Lelli <juri.lelli@gmail.com> Fixes: 3960c8c0c789 ("sched: Make dl_task_time() use task_rq_lock()") Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/core: Remove unused argument from init_[rt|dl]_rq()Abel Vesa
Obviously, 'rq' is not used in these two functions, therefore, there is no reason for it to be passed as an argument. Signed-off-by: Abel Vesa <abelvesa@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>