summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-01-13sched/preempt: Take away preempt_enable_no_resched() from modulesPeter Zijlstra
Discourage drivers/modules to be creative with preemption. Sadly all is implemented in macros and inline so if they want to do evil they still can, but at least try and discourage some. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: rui.zhang@intel.com Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Cc: rjw@rjwysocki.net Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-fn7h6vu8wtgxk0ih402qcijx@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13locking: Optimize lock_bh functionsPeter Zijlstra
Currently all _bh_ lock functions do two preempt_count operations: local_bh_disable(); preempt_disable(); and for the unlock: preempt_enable_no_resched(); local_bh_enable(); Since its a waste of perfectly good cycles to modify the same variable twice when you can do it in one go; use the new __local_bh_{dis,en}able_ip() functions that allow us to provide a preempt_count value to add/sub. So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to add/sub to be done in one go. As a bonus it gets rid of the preempt_enable_no_resched() usage. This reduces a 1000 loops of: spin_lock_bh(&bh_lock); spin_unlock_bh(&bh_lock); from 53596 cycles to 51995 cycles. I didn't do enough measurements to say for absolute sure that the result is significant but the the few runs I did for each suggest it is so. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Cc: rjw@rjwysocki.net Cc: rui.zhang@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Factor out the on_null_domain() checks in trigger_load_balance()Daniel Lezcano
The test on_null_domain is done twice in the trigger_load_balance function. Move the test at the begin of the function, so there is only one check. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-9-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to nohz_idle_balance()Daniel Lezcano
The cpu information is stored in the struct rq. Pass the struct rq to nohz_idle_balance, so all the functions called in run_rebalance_domains have the same parameters and the 'this_cpu' variable becomes pointless. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ Added !SMP build fix. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-8-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to rebalance_domains()Daniel Lezcano
The cpu information is stored in the struct rq and the caller of the rebalance_domains function pass the cpu to retrieve the struct rq but it already has the struct rq info. Replace the cpu parameter with the struct rq. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-7-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Remove unused parameter from nohz_balancer_kick()Daniel Lezcano
The cpu parameter is no longer needed in nohz_balancer_kick, let's remove the parameter. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-6-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Remove unused parameter from find_new_ilb()Daniel Lezcano
The 'call_cpu' is never used in the function. Remove it. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-5-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Pass 'struct rq' to on_null_domain()Daniel Lezcano
The on_null_domain() function is getting the cpu to retrieve the struct rq associated with it. Pass 'struct rq' directly to the function as the caller already has the info. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-4-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Reduce nohz_kick_needed() parametersDaniel Lezcano
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the nohz_kick_needed function. The caller of this function just called idle_cpu() before to fill the rq->idle_balance field. Use rq->cpu and rq->idle_balance. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-3-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Reduce trigger_load_balance() parametersDaniel Lezcano
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the trigger_load_balance function. Cc: linaro-kernel@lists.linaro.org Cc: preeti.lkml@gmail.com Cc: mingo@redhat.com Cc: peterz@infradead.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Fix hotplug admission controlPeter Zijlstra
The current hotplug admission control is broken because: CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task() cannot fail and hard assumes it _will_ move all tasks off of the dying cpu, failing this will break hotplug. The much simpler solution is a DOWN_PREPARE handler that fails when removing one CPU gets us below the total allocated bandwidth. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Remove the sysctl_sched_dl knobsPeter Zijlstra
Remove the deadline specific sysctls for now. The problem with them is that the interaction with the exisiting rt knobs is nearly impossible to get right. The current (as per before this patch) situation is that the rt and dl bandwidth is completely separate and we enforce rt+dl < 100%. This is undesirable because this means that the rt default of 95% leaves us hardly any room, even though dl tasks are saver than rt tasks. Another proposed solution was (a discarted patch) to have the dl bandwidth be a fraction of the rt bandwidth. This is highly confusing imo. Furthermore neither proposal is consistent with the situation we actually want; which is rt tasks ran from a dl server. In which case the rt bandwidth is a direct subset of dl. So whichever way we go, the introduction of dl controls at this point is painful. Therefore remove them and instead share the rt budget. This means that for now the rt knobs are used for dl admission control and the dl runtime is accounted against the rt runtime. I realise that this isn't entirely desirable either; but whatever we do we appear to need to change the interface later, so better have a small interface for now. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Fix up the smp-affinity mask testsPeter Zijlstra
For now deadline tasks are not allowed to set smp affinity; however the current tests are wrong, cure this. The test in __sched_setscheduler() also uses an on-stack cpumask_t which is a no-no. Change both tests to use cpumask_subset() such that we test the root domain span to be a subset of the cpus_allowed mask. This way we're sure the tasks can always run on all CPUs they can be balanced over, and have no effective affinity constraints. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-fyqtb1lapxca3lhsxv9cumdc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: speed up SCHED_DEADLINE pushes with a push-heapJuri Lelli
Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add bandwidth management for SCHED_DEADLINE tasksDario Faggioli
In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add SCHED_DEADLINE inheritance logicDario Faggioli
Some method to deal with rt-mutexes and make sched_dl interact with the current PI-coded is needed, raising all but trivial issues, that needs (according to us) to be solved with some restructuring of the pi-code (i.e., going toward a proxy execution-ish implementation). This is under development, in the meanwhile, as a temporary solution, what this commits does is: - ensure a pi-lock owner with waiters is never throttled down. Instead, when it runs out of runtime, it immediately gets replenished and it's deadline is postponed; - the scheduling parameters (relative deadline and default runtime) used for that replenishments --during the whole period it holds the pi-lock-- are the ones of the waiting task with earliest deadline. Acting this way, we provide some kind of boosting to the lock-owner, still by using the existing (actually, slightly modified by the previous commit) pi-architecture. We would stress the fact that this is only a surely needed, all but clean solution to the problem. In the end it's only a way to re-start discussion within the community. So, as always, comments, ideas, rants, etc.. are welcome! :-) Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Added !RT_MUTEXES build fix. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13rtmutex: Turn the plist into an rb-treePeter Zijlstra
Turn the pi-chains from plist to rb-tree, in the rt_mutex code, and provide a proper comparison function for -deadline and -priority tasks. This is done mainly because: - classical prio field of the plist is just an int, which might not be enough for representing a deadline; - manipulating such a list would become O(nr_deadline_tasks), which might be to much, as the number of -deadline task increases. Therefore, an rb-tree is used, and tasks are queued in it according to the following logic: - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the one with the higher (lower, actually!) prio wins; - among a -priority and a -deadline task, the latter always wins; - among two -deadline tasks, the one with the earliest deadline wins. Queueing and dequeueing functions are changed accordingly, for both the list of a task's pi-waiters and the list of tasks blocked on a pi-lock. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-again-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add latency tracing for SCHED_DEADLINE tasksDario Faggioli
It is very likely that systems that wants/needs to use the new SCHED_DEADLINE policy also want to have the scheduling latency of the -deadline tasks under control. For this reason a new version of the scheduling wakeup latency, called "wakeup_dl", is introduced. As a consequence of applying this patch there will be three wakeup latency tracer: * "wakeup", that deals with all tasks in the system; * "wakeup_rt", that deals with -rt and -deadline tasks only; * "wakeup_dl", that deals with -deadline tasks only. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add period support for SCHED_DEADLINE tasksHarald Gustafsson
Make it possible to specify a period (different or equal than deadline) for -deadline tasks. Relative deadlines (D_i) are used on task arrivals to generate new scheduling (absolute) deadlines as "d = t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d = d + P_i" when the budget is zero. This is in general useful to model (and schedule) tasks that have slow activation rates (long periods), but have to be scheduled soon once activated (short deadlines). Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add SCHED_DEADLINE avg_update accountingDario Faggioli
Make the core scheduler and load balancer aware of the load produced by -deadline tasks, by updating the moving average like for sched_rt. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logicJuri Lelli
Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched/deadline: Add SCHED_DEADLINE structures & implementationDario Faggioli
Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13sched: Add new scheduler syscalls to support an extended scheduling ↵Dario Faggioli
parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13Merge branch 'sched/urgent' into sched/coreIngo Molnar
Pick up the latest fixes before applying new changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13futexes: Avoid taking the hb->lock if there's nothing to wake upDavidlohr Bueso
In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13futexes: Document multiprocessor ordering guaranteesThomas Gleixner
That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13futexes: Increase hash table size for better performanceDavidlohr Bueso
Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             | |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             | |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             | |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             | |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             | |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Waiman Long <Waiman.Long@hp.com> Reviewed-and-tested-by: Jason Low <jason.low2@hp.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13futexes: Clean up various detailsJason Low
- Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13Merge tag 'v3.13-rc8' into core/lockingIngo Molnar
Refresh the tree with the latest fixes, before applying new changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13m68k/irq - Use polled IRQ flag for MFP timer cascaded interruptsMichael Schmitz
Some Atari hardware has no capacity to raise interrupts (e.g. network or USB adapter hardware attached via ROM port). The driver interrupt routine is called from a timer interrupt (timer D) in these cases, using chained device specific pseudo interrupts (IRQ_MFP_TIMER1 ff.) These interrupts will more often than not, return IRQ_NONE as there is not always work for the device handler when called. Too many unhandled interrupts will result in the interrupt being disabled by the stuck interrupt watchdog. As preferred option to flag interrupts as needing exclusion from the watchdog mechanism, tglx added the IRQ_IS_POLLED flag for use in such a case. Currently, two interrupts need to use this flag. Add more users as needed. Signed-off-by: Michael Schmitz <schmitz@debian.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2014-01-13Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc fix from Ben Herrenschmidt: "Here's one regression fix for 3.13 that I would appreciate if you could still pull in. It was an "interesting" one to debug, basically it's an old bug that got somewhat "exposed" by new code breaking the boot on PA Semi boards (yes, it does appear that some people are still using these!)" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc: Check return value of instance-to-package OF call
2014-01-13Merge branch 'x86/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "Sorry, meant to push out this batch earlier this weekend" * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround ftrace/x86: Load ftrace_ops in parameter not the variable holding it
2014-01-13powerpc: Check return value of instance-to-package OF callBenjamin Herrenschmidt
On PA-Semi firmware, the instance-to-package callback doesn't seem to be implemented. We didn't check for error, however, thus subsequently passed the -1 value returned into stdout_node to thins like prom_getprop etc... Thus caused the firmware to load values around 0 (physical) internally as node structures. It somewhat "worked" as long as we had a NULL in the right place (address 8) at the beginning of the kernel, we didn't "see" the bug. But commit 5c0484e25ec03243d4c2f2d4416d4a13efc77f6a "powerpc: Endian safe trampoline" changed the kernel entry point causing that old bug to now cause a crash early during boot. This fixes booting on PA-Semi board by properly checking the return value from instance-to-package. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Olof Johansson <olof@lixom.net> ---
2014-01-12Merge tag 'ras_for_3.14_p2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull RAS updates from Borislav Petkov: " SCI reporting for other error types not only correctable ones + APEI GHES cleanups + mce timer fix " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Merge tag 'v3.13-rc8' into x86/ras, to pick up fixes.Ingo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Merge tag 'perf-core-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf updates from Arnaldo Carvalho de Melo: User visible changes: Improvements: * Support showing source code, asking for variables to be collected at probe time and other 'perf probe' operations that use DWARF information. This supports only binaries with debugging information at this time, detached debuginfo (aka debuginfo packages) support should come in later patches. (Masami Hiramatsu) * Add a perf.data file header window in the 'perf report' TUI, associated with the 'i' hotkey, providing a counterpart to the --header option in the stdio UI. (Namhyung Kim) * Guest related improvements to 'perf kvm', including allowing to specify a directory with guest specific /proc information. (Dongsheng Yang) * Print session information only if --stdio is given (Namhyung Kim) Developer stuff: Fixes: * Get rid of a duplicate va_end() in error reporting (Namhyung Kim) * If a hist entry doesn't have symbol information, compare it with its address. Affects upcoming new feature (--cumulate) (Namhyung Kim) Improvements: * Make libtraceevent install target quieter (Jiri Olsa) * Make tests/make output more compact (Jiri Olsa) * Ignore generated files in feature-checks (Chunwei Chen) New APIs: * Introduce pevent_filter_strerror() in libtraceevent, similar in purpose to libc's strerror() function. (Namhyung Kim) Refactorings: * Use perf_data_file methods to write output file in 'record' and 'inject' (Jiri Olsa) * Use pr_*() functions where applicable in 'report' (Namhyumg Kim) * Add 'machine' 'addr_location' struct to have full picture (machine, thread, map, symbol, addr) for a (partially) resolved address, reducing function signatures (Arnaldo Carvalho de Melo) * Reduce code duplication in the histogram entry creation/insertion. (Arnaldo Carvalho de Melo) * Auto allocate annotation histogram data structures, (Arnaldo Carvalho de Melo) * No need to test against NULL before calling free, also set freed memory in struct pointers to NULL, to help fixing use after free bugs. (Arnaldo Carvalho de Melo> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12um, x86: Fix vDSO buildRichard Weinberger
Commit 663b55b9b39 ("x86: Delete non-required instances of include <linux/init.h>") broke the UML build. arch/x86/um/vdso/vdso.S: Assembler messages: arch/x86/um/vdso/vdso.S:2: Error: no such instruction: `__initdata' arch/x86/um/vdso/vdso.S:9: Error: no such instruction: `__finit' UML's vDSO needs linux/init.h. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Richard Weinberger <richard@nod.at> Cc: user-mode-linux-devel@lists.sourceforge.net Cc: paul.gortmaker@windriver.com Link: http://lkml.kernel.org/r/1389538341-31383-1-git-send-email-richard@nod.at Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12x86, mce: Fix mce_start_timer semanticsBorislav Petkov
So mce_start_timer() has a 'cpu' argument which is supposed to mean to start a timer on that cpu. However, the code currently starts a timer on the *current* cpu the function runs on and causes the sanity-check in mce_timer_fn to fire: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:1286 mce_timer_fn because it is running on the wrong cpu. This was triggered by Prarit Bhargava <prarit@redhat.com> by offlining all the cpus in succession. Then, we were fiddling with the CMCI storm settings when starting the timer whereas there's no need for that - if there's storm happening on this newly restarted cpu, we're going to be in normal CMCI mode initially and then when the CMCI interrupt starts firing, we're going to go to the polling mode with the timer real soon. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Prarit Bhargava <prarit@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: Chen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1387722156-5511-1-git-send-email-prarit@redhat.com
2014-01-12ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disablingTaras Kondratiuk
Kexec disables outer cache before jumping to reboot code, but it doesn't flush it explicitly. Flush is done implicitly inside of l2x0_disable(). But some SoC's override default .disable handler and don't flush cache. This may lead to a corrupted memory during Kexec reboot on these platforms. This patch adds cache flush inside of OMAP4 and Highbank outer_cache.disable() handlers to make it consistent with default l2x0_disable(). Acked-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Taras Kondratiuk <taras.kondratiuk@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2014-01-12Merge branch 'fortglx/3.14/time' of ↵Ingo Molnar
git://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeeping updates from John Stultz. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Merge branch 'linus' into timers/coreIngo Molnar
Pick up the latest fixes and refresh the branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqsPrarit Bhargava
During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12Linux 3.13-rc8v3.13-rc8Linus Torvalds
2014-01-12SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()Steven Rostedt
While running stress tests on adding and deleting ftrace instances I hit this bug: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: selinux_inode_permission+0x85/0x160 PGD 63681067 PUD 7ddbe067 PMD 0 Oops: 0000 [#1] PREEMPT CPU: 0 PID: 5634 Comm: ftrace-test-mki Not tainted 3.13.0-rc4-test-00033-gd2a6dde-dirty #20 Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006 task: ffff880078375800 ti: ffff88007ddb0000 task.ti: ffff88007ddb0000 RIP: 0010:[<ffffffff812d8bc5>] [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160 RSP: 0018:ffff88007ddb1c48 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000800000 RCX: ffff88006dd43840 RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffff88006ee46000 RBP: ffff88007ddb1c88 R08: 0000000000000000 R09: ffff88007ddb1c54 R10: 6e6576652f6f6f66 R11: 0000000000000003 R12: 0000000000000000 R13: 0000000000000081 R14: ffff88006ee46000 R15: 0000000000000000 FS: 00007f217b5b6700(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M CR2: 0000000000000020 CR3: 000000006a0fe000 CR4: 00000000000007f0 Call Trace: security_inode_permission+0x1c/0x30 __inode_permission+0x41/0xa0 inode_permission+0x18/0x50 link_path_walk+0x66/0x920 path_openat+0xa6/0x6c0 do_filp_open+0x43/0xa0 do_sys_open+0x146/0x240 SyS_open+0x1e/0x20 system_call_fastpath+0x16/0x1b Code: 84 a1 00 00 00 81 e3 00 20 00 00 89 d8 83 c8 02 40 f6 c6 04 0f 45 d8 40 f6 c6 08 74 71 80 cf 02 49 8b 46 38 4c 8d 4d cc 45 31 c0 <0f> b7 50 20 8b 70 1c 48 8b 41 70 89 d9 8b 78 04 e8 36 cf ff ff RIP selinux_inode_permission+0x85/0x160 CR2: 0000000000000020 Investigating, I found that the inode->i_security was NULL, and the dereference of it caused the oops. in selinux_inode_permission(): isec = inode->i_security; rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd); Note, the crash came from stressing the deletion and reading of debugfs files. I was not able to recreate this via normal files. But I'm not sure they are safe. It may just be that the race window is much harder to hit. What seems to have happened (and what I have traced), is the file is being opened at the same time the file or directory is being deleted. As the dentry and inode locks are not held during the path walk, nor is the inodes ref counts being incremented, there is nothing saving these structures from being discarded except for an rcu_read_lock(). The rcu_read_lock() protects against freeing of the inode, but it does not protect freeing of the inode_security_struct. Now if the freeing of the i_security happens with a call_rcu(), and the i_security field of the inode is not changed (it gets freed as the inode gets freed) then there will be no issue here. (Linus Torvalds suggested not setting the field to NULL such that we do not need to check if it is NULL in the permission check). Note, this is a hack, but it fixes the problem at hand. A real fix is to restructure the destroy_inode() to call all the destructor handlers from the RCU callback. But that is a major job to do, and requires a lot of work. For now, we just band-aid this bug with this fix (it works), and work on a more maintainable solution in the future. Link: http://lkml.kernel.org/r/20140109101932.0508dec7@gandalf.local.home Link: http://lkml.kernel.org/r/20140109182756.17abaaa8@gandalf.local.home Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-12thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once onlyHugh Dickins
We see General Protection Fault on RSI in copy_page_rep: that RSI is what you get from a NULL struct page pointer. RIP: 0010:[<ffffffff81154955>] [<ffffffff81154955>] copy_page_rep+0x5/0x10 RSP: 0000:ffff880136e15c00 EFLAGS: 00010286 RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200 RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000 RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001 R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0 Call Trace: copy_user_huge_page+0x93/0xab do_huge_pmd_wp_page+0x710/0x815 handle_mm_fault+0x15d8/0x1d70 __do_page_fault+0x14d/0x840 do_page_fault+0x2f/0x90 page_fault+0x22/0x30 do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but since shrink_huge_zero_page() can free the huge_zero_page, and we have no hold of our own on it here (except where the fourth test holds page_table_lock and has checked pmd_same), it's possible for it to answer yes the first time, but no to the second or third test. Change all those last three to tests for NULL page. (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin in https://lkml.org/lkml/2013/3/29/103. I believe that one is due to the source page being split, and a tail page freed, while copy is in progress; and not a problem without DEBUG_PAGEALLOC, since the pmd_same check will prevent a miscopy from being made visible.) Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-12arch: Introduce smp_load_acquire(), smp_store_release()Peter Zijlstra
A number of situations currently require the heavyweight smp_mb(), even though there is no need to order prior stores against later loads. Many architectures have much cheaper ways to handle these situations, but the Linux kernel currently has no portable way to make use of them. This commit therefore supplies smp_load_acquire() and smp_store_release() to remedy this situation. The new smp_load_acquire() primitive orders the specified load against any subsequent reads or writes, while the new smp_store_release() primitive orders the specifed store against any prior reads or writes. These primitives allow array-based circular FIFOs to be implemented without an smp_mb(), and also allow a theoretical hole in rcu_assign_pointer() to be closed at no additional expense on most architectures. In addition, the RCU experience transitioning from explicit smp_read_barrier_depends() and smp_wmb() to rcu_dereference() and rcu_assign_pointer(), respectively resulted in substantial improvements in readability. It therefore seems likely that replacing other explicit barriers with smp_load_acquire() and smp_store_release() will provide similar benefits. It appears that roughly half of the explicit barriers in core kernel code might be so replaced. [Changelog by PaulMck] Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Victor Kaplansky <VICTORK@il.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131213150640.908486364@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12arch: Clean up asm/barrier.h implementations using asm-generic/barrier.hPeter Zijlstra
We're going to be adding a few new barrier primitives, and in order to avoid endless duplication make more agressive use of asm-generic/barrier.h. Change the asm-generic/barrier.h such that it allows partial barrier definitions and fills out the rest with defaults. There are a few architectures (m32r, m68k) that could probably do away with their barrier.h file entirely but are kept for now due to their unconventional nop() implementation. Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Victor Kaplansky <VICTORK@il.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20131213150640.846368594@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12arch: Move smp_mb__{before,after}_atomic_{inc,dec}.h into asm/atomic.hPeter Zijlstra
Move the barriers functions that depend on the atomic implementation into the atomic implementation. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vineet Gupta <vgupta@synopsys.com> [for arch/arc bits] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20131213150640.786183683@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12locking/doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASEPeter Zijlstra
The LOCK and UNLOCK barriers as described in our barrier document are generally known as ACQUIRE and RELEASE barriers in other literature. Since we plan to introduce the acquire and release nomenclature in generic kernel primitives we should amend the document to avoid confusion as to what an acquire/release means. Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Victor Kaplansky <VICTORK@il.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131217092435.GC21999@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-12block: null_blk: fix queue leak inside removing deviceMing Lei
When queue_mode is NULL_Q_MQ and null_blk is being removed, blk_cleanup_queue() isn't called to cleanup queue, so the queue allocated won't be freed. This patch calls blk_cleanup_queue() for MQ to drain all pending requests first and release the reference counter of queue kobject, then blk_mq_free_queue() will be called in queue kobject's release handler when queue kobject's reference counter drops to zero. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>