summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-08-20sched/fair: Fix reschedule which is generated on throttled cfs_rqKirill Tkhai
(sched_entity::on_rq == 1) does not guarantee the task is pickable; changes on throttled cfs_rq must not lead to reschedule. Check for task_struct::on_rq instead. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407312361.8424.35.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: Match declaration with definitionPranith Kumar
Match the declaration of runqueues with the definition. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407950893-32731-1-git-send-email-bobby.prani@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: Change autogroup_move_group() to use for_each_thread()Oleg Nesterov
Change autogroup_move_group() to use for_each_thread() instead of buggy while_each_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sanjay Rao <srao@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140813192003.GA19334@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: Change thread_group_cputime() to use for_each_thread()Oleg Nesterov
Change thread_group_cputime() to use for_each_thread() instead of buggy while_each_thread(). This also makes the pid_alive() check unnecessary. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sanjay Rao <srao@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140813192000.GA19327@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: s/do_each_thread/for_each_process_thread/ in debug.cOleg Nesterov
Change kernel/sched/debug.c to use for_each_process_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sanjay Rao <srao@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140813191956.GA19324@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20sched: s/do_each_thread/for_each_process_thread/ in core.cOleg Nesterov
Change kernel/sched/core.c to use for_each_process_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Frank Mayhar <fmayhar@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sanjay Rao <srao@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140813191953.GA19315@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20perf: Handle compat ioctlPawel Moll
When running a 32-bit userspace on a 64-bit kernel (eg. i386 application on x86_64 kernel or 32-bit arm userspace on arm64 kernel) some of the perf ioctls must be treated with special care, as they have a pointer size encoded in the command. For example, PERF_EVENT_IOC_ID in 32-bit world will be encoded as 0x80042407, but 64-bit kernel will expect 0x80082407. In result the ioctl will fail returning -ENOTTY. This patch solves the problem by adding code fixing up the size as compat_ioctl file operation. Reported-by: Drew Richardson <drew.richardson@arm.com> Signed-off-by: Pawel Moll <pawel.moll@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1402671812-9078-1-git-send-email-pawel.moll@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-18cgroup: reject cgroup names with '\n'Alban Crequy
/proc/<pid>/cgroup contains one cgroup path on each line. If cgroup names are allowed to contain "\n", applications cannot parse /proc/<pid>/cgroup safely. Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-08-18watchdog: Fix print-once on enableUlrich Obergfell
This patch avoids printing the message 'enabled on all CPUs, ...' multiple times. For example, the issue can occur in the following scenario: 1) watchdog_nmi_enable() fails to enable PMU counters and sets cpu0_err. 2) 'echo [0|1] > /proc/sys/kernel/nmi_watchdog' is executed to disable and re-enable the watchdog mechanism 'on the fly'. 3) If watchdog_nmi_enable() succeeds to enable PMU counters, each CPU will print the message because step1 left behind a non-zero cpu0_err. if (!IS_ERR(event)) { if (cpu == 0 || cpu0_err) pr_info("enabled on all CPUs, ...") The patch avoids this by clearing cpu0_err in watchdog_nmi_disable(). Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1407768567-171794-4-git-send-email-dzickus@redhat.com [ Applied small cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-18watchdog: Remove unnecessary header fileschai wen
Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1407768567-171794-2-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-17mm: Support compiling out madvise and fadviseJosh Triplett
Many embedded systems will not need these syscalls, and omitting them saves space. Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS (default y) to support compiling them out. bloat-o-meter: add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250) function old new delta sys_fadvise64 57 - -57 sys_fadvise64_64 691 - -691 sys_madvise 1502 - -1502 Signed-off-by: Josh Triplett <josh@joshtriplett.org>
2014-08-16module: Clean up ro/nx after early module load failuresAndy Lutomirski
The commit 4982223e51e8 module: set nx before marking module MODULE_STATE_COMING. introduced a regression: if a module fails to parse its arguments or if mod_sysfs_setup fails, then the module's memory will be freed while still read-only. Anything that reuses that memory will crash as soon as it tries to write to it. Cc: stable@vger.kernel.org # v3.16 Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-08-14Merge tag 'pm+acpi-3.17-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI and power management updates from Rafael Wysocki: "These are a couple of regression fixes, cpuidle menu governor optimizations, fixes for ACPI proccessor and battery drivers, hibernation fix to avoid problems related to the e820 memory map, fixes for a few cpufreq drivers and a new version of the suspend profiling tool analyze_suspend.py. Specifics: - Fix for an ACPI-based device hotplug regression introduced in 3.14 that causes a kernel panic to trigger when memory hot-remove is attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen - Fix for a cpufreq regression introduced in 3.16 that triggers a "sleeping function called from invalid context" bug in dev_pm_opp_init_cpufreq_table() from Stephen Boyd - ACPI battery driver fix for a warning message added in 3.16 that prints silly stuff sometimes from Mariusz Ceier - Hibernation fix for safer handling of mismatches in the 820 memory map between the configurations during image creation and during the subsequent restore from Chun-Yi Lee - ACPI processor driver fix to handle CPU hotplug notifications correctly during system suspend/resume from Lan Tianyu - Series of four cpuidle menu governor cleanups that also should speed it up a bit from Mel Gorman - Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus Pargmann and Uwe Kleine-König - Version 3.0 of the analyze_suspend.py suspend profiling tool from Todd E Brandt" * tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / battery: Fix warning message in acpi_battery_get_state() PM / tools: analyze_suspend.py: update to v3.0 cpufreq: arm_big_little: fix module license spec cpufreq: speedstep-smi: fix decimal printf specifiers ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove() cpufreq: OPP: Avoid sleeping while atomic cpufreq: cpu0: Do not print error message when deferring cpufreq: integrator: Use set_cpus_allowed_ptr PM / hibernate: avoid unsafe pages in e820 reserved regions ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events cpuidle: menu: Lookup CPU runqueues less cpuidle: menu: Call nr_iowait_cpu less times cpuidle: menu: Use ktime_to_us instead of reinventing the wheel cpuidle: menu: Use shifts when calculating averages where possible
2014-08-14timekeeping: Another fix to the VSYSCALL_OLD update_vsyscallJohn Stultz
Benjamin Herrenschmidt pointed out that I further missed modifying update_vsyscall after the wall_to_mono value was changed to a timespec64. This causes issues on powerpc32, which expects a 32bit timespec. This patch fixes the problem by properly converting from a timespec64 to a timespec before passing the value on to the arch-specific vsyscall logic. [ Thomas is currently on vacation, but reviewed it and wanted me to send this fix on to you directly. ] Cc: LKML <linux-kernel@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-14Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull more powerpc updates from Ben Herrenschmidt: "Here are some more powerpc bits for 3.17, essentially fixes. The biggest series, also aimed at -stable, is from Aneesh and is the result of weeks and weeks of debugging to find out why the heck or THP implementation was occasionally triggering multi-hit errors in our level 1 TLB. It ended up being a combination of issues including subtleties as to how we should invalidate those special 'MPSS' pages we use to allow the use of 16M pages inside 4K/64K "base page size" segments (you really have to love our MMU !) Another interesting one in the "OMG" category is the series from Michael adding memory barriers to spin_is_locked(). That's also the result of many days of debugging to figure out why the semaphore code would occasionally crash in ways that made no sense. It ended up being some creative lock stacking that was defeated by the fact that our locks allow a load inside the locked section to be re-ordered with the load of the lock value itself (I'm still of two mind about whether to kill that once and for all by putting a heavier barrier back into our lock implementation...). The fixes come with a long explanation in the cset comments, feel free to read it if you feel like having a headache today" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits) powerpc/thp: Add tracepoints to track hugepage invalidate powerpc/mm: Use read barrier when creating real_pte powerpc/thp: Use ACCESS_ONCE when loading pmdp powerpc/thp: Invalidate with vpn in loop powerpc/thp: Handle combo pages in invalidate powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte powerpc/thp: Don't recompute vsid and ssize in loop on invalidate powerpc/thp: Add write barrier after updating the valid bit powerpc: reorder per-cpu NUMA information's initialization powerpc/perf/hv-24x7: Use kmem_cache_free powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info powerpc: Hard disable interrupts in xmon powerpc: remove duplicate definition of TEXASR_FS powerpc/pseries: Avoid deadlock on removing ddw powerpc/pseries: Failure on removing device node powerpc/boot: Use correct zlib types for comparison powerpc/powernv: Interface to register/unregister opal dump region printk: Add function to return log buffer address and size powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS powerpc/ppc476: Disable BTAC ...
2014-08-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull seccomp fix from James Morris. BUG(!spin_is_locked()) really doesn't work very well in UP configurations without any actual spinlock state. Which is very much why we have that "assert_spin_lock()" function for this. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock
2014-08-13locking/lockdep: Restrict the use of recursive read_lock() with qrwlockWaiman Long
Unlike the original unfair rwlock implementation, queued rwlock will grant lock according to the chronological sequence of the lock requests except when the lock requester is in the interrupt context. Consequently, recursive read_lock calls will now hang the process if there is a write_lock call somewhere in between the read_lock calls. This patch updates the lockdep implementation to look for recursive read_lock calls. A new read state (3) is used to mark those read_lock call that cannot be recursively called except in the interrupt context. The new read state does exhaust the 2 bits available in held_lock:read bit field. The addition of any new read state in the future may require a redesign of how all those bits are squeezed together in the held_lock structure. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407345722-61615-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/Documentation: Move locking related docs into Documentation/locking/Davidlohr Bueso
Specifically: Documentation/locking/lockdep-design.txt Documentation/locking/lockstat.txt Documentation/locking/mutex-design.txt Documentation/locking/rt-mutex-design.txt Documentation/locking/rt-mutex.txt Documentation/locking/spinlocks.txt Documentation/locking/ww-mutex-design.txt Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: jason.low2@hp.com Cc: aswin@hp.com Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Mason <clm@fb.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Airlie <airlied@linux.ie> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Low <jason.low2@hp.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lubomir Rintel <lkundrak@v3.sk> Cc: Masanari Iida <standby24x7@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: fengguang.wu@intel.com Link: http://lkml.kernel.org/r/1406752916-3341-6-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriateDavidlohr Bueso
4badad35 ("locking/mutex: Disable optimistic spinning on some architectures") added a ARCH_SUPPORTS_ATOMIC_RMW flag to disable the mutex optimistic feature on specific archs. Because CONFIG_MUTEX_SPIN_ON_OWNER only depended on DEBUG and SMP, it was ok to have the ->owner field conditional a bit flexible. However by adding a new variable to the matter, we can waste space with the unused field, ie: CONFIG_SMP && (!CONFIG_MUTEX_SPIN_ON_OWNER && !CONFIG_DEBUG_MUTEX). Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: aswin@hp.com Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1406752916-3341-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/mutexes: Refactor optimistic spinning codeDavidlohr Bueso
When we fail to acquire the mutex in the fastpath, we end up calling __mutex_lock_common(). A *lot* goes on in this function. Move out the optimistic spinning code into mutex_optimistic_spin() and simplify the former a bit. Furthermore, this is similar to what we have in rwsems. No logical changes. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: aswin@hp.com Cc: mingo@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406752916-3341-4-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/mcs: Remove obsolete commentDavidlohr Bueso
... as we clearly inline mcs_spin_lock() now. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: aswin@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406752916-3341-3-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/mutexes: Document quick lock release when unlockingDavidlohr Bueso
When unlocking, we always want to reach the slowpath with the lock's counter indicating it is unlocked. -- as returned by the asm fastpath call or by explicitly setting it. While doing so, at least in theory, we can optimize and allow faster lock stealing. When unlocking, we always want to reach the slowpath with the lock's counter indicating it is unlocked. -- as returned by the asm fastpath call or by explicitly setting it. While doing so, at least in theory, we can optimize and allow faster lock stealing. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: jason.low2@hp.com Cc: aswin@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406752916-3341-2-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking/mutexes: Standardize arguments in lock/unlock slowpathsDavidlohr Bueso
Just how the locking-end behaves, when unlocking, go ahead and obtain the proper data structure immediately after the previous (asm-end) call exits and there are (probably) pending waiters. This simplifies a bit some of the layering. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: jason.low2@hp.com Cc: aswin@hp.com Cc: mingo@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1406752916-3341-1-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13locking: Remove deprecated smp_mb__() barriersPeter Zijlstra
Its been a while and there are no in-tree users left, so remove the deprecated barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chen, Gong <gong.chen@linux.intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Joe Perches <joe@perches.com> Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13perf: Do poll_wait() before checking condition in perf_poll()Sebastian Andrzej Siewior
One should first enqueue to the waitqueue and then check for the condition. If the condition gets true after mutex_unlock() but before poll_wait() then we lose it and would have wait for another wakeup. This has been like this since v2.6.31-rc1 commit c7138f37f9 ("perf_counter: fix perf_poll()"). Before that it was slightly worse. I guess we get enough wakeups so if we miss here one it doesn't really matter. It is still a bad example. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1407159068-1478-1-git-send-email-bigeasy@linutronix.de Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13perf: Add queued work to remove orphaned child eventsJiri Olsa
In cases when the owner task exits before the workload and the workload made some forks, all the events stay in until the last workload process exits. Thats' because each child event holds parent reference. We want to release all children events once the parent is gone, because at that time there's no process to read them anyway, so they're just eating resources. This removal races with process exit, which removes all events and fork, which clone events. To be clear of those two, adding work queue to remove orphaned child for context in case such event is detected. Using delayed work queue (with delay == 1), because we queue this work under perf scheduler callbacks. Normal work queue tries to wake up the queue process, which deadlocks on rq->lock in this place. Also preventing clones from abandoned parent event. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406896382-18404-4-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13perf: Set owner pointer for kernel eventsJiri Olsa
Adding fake EVENT_OWNER_KERNEL owner pointer value for kernel perf events, so we could distinguish it from user events, which needs special care in following patch. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1406896382-18404-3-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-13printk: Add function to return log buffer address and sizeVasant Hegde
Platforms like IBM Power Systems supports service processor assisted dump. It provides interface to add memory region to be captured when system is crashed. During initialization/running we can add kernel memory region to be collected. Presently we don't have a way to get the log buffer base address and size. This patch adds support to return log buffer address and size. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2014-08-12sched/numa: Fix numa capacity computationRik van Riel
Commit c61037e9 fixes the phenomenon of 'fantom' cores due to N*frac(smt_power) >= 1 by limiting the capacity to the actual number of cores in the load balancing code. This patch applies the same correction to the NUMA balancing code. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mgorman@suse.de Cc: vincent.guittot@linaro.org Cc: Morten.Rasmussen@arm.com Cc: nicolas.pitre@linaro.org Cc: efault@gmx.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407173008-9334-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched/numa: Fix off-by-one in capacity checkRik van Riel
Commit a43455a1d572daf7b730fe12eb747d1e17411365 ensures that task_numa_migrate will call task_numa_compare on the preferred node all the time, even when the preferred node has no free capacity. This could lead to a performance regression if nr_running == capacity on both the source and the destination node. This can be avoided by also checking for nr_running == capacity on the source node, which is one stricter than checking .has_free_capacity. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mgorman@suse.de Cc: vincent.guittot@linaro.org Cc: Morten.Rasmussen@arm.com Cc: nicolas.pitre@linaro.org Cc: efault@gmx.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1407173008-9334-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched: Rename a misleading variable in build_overlap_sched_groups()Zhihui Zhang
The child variable in build_overlap_sched_groups() actually refers to the peer or sibling domain of the given CPU. Rename it to sibling to be consistent with the naming in build_group_mask(). Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1406942283-18249-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched/fair: Allow calculate_imbalance() to move idle cpusPeter Zijlstra
Allow calculate_imbalance() to 'create' idle cpus in the busiest group if there are idle cpus in the local group. Suggested-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140729152705.GX12054@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sdRik van Riel
Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains that are not overloaded, in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest return true when the busiest sd yet is encountered. Groups are ranked in the order overloaded > imbalanced > other, with higher ranked groups getting priority even when their load is lower. This is necessary due to the possibility of unequal capacities and cpumasks between domains within a sched group. Behaviour for SD_ASYM_PACKING does not seem to match the comment, but I have no hardware to test that so I have left the behaviour of that code unchanged. Enum for group classification suggested by Peter Zijlstra. Signed-off-by: Rik van Riel <riel@redhat.com> [peterz: replaced sg_lb_stats::group_imb with the new enum group_type in an attempt to avoid endless recalculation] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Michael Neuling <mikey@neuling.org> Cc: ktkhai@parallels.com Cc: tim.c.chen@linux.intel.com Cc: nicolas.pitre@linaro.org Cc: jhladky@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140729152743.GI3935@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-12sched/fair: Make calculate_imbalance() independentPeter Zijlstra
Rik noticed that calculate_imbalance() relies on update_sd_pick_busiest() to guarantee that busiest->sum_nr_running > busiest->group_capacity_factor. Break this implicit assumption (with the intent of not providing it anymore) by having calculat_imbalance() verify it and not rely on others. Reported-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140729152631.GW12054@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-11Merge branches 'pm-sleep', 'pm-cpufreq' and 'pm-cpuidle'Rafael J. Wysocki
* pm-sleep: PM / hibernate: avoid unsafe pages in e820 reserved regions * pm-cpufreq: cpufreq: arm_big_little: fix module license spec cpufreq: speedstep-smi: fix decimal printf specifiers cpufreq: OPP: Avoid sleeping while atomic cpufreq: cpu0: Do not print error message when deferring cpufreq: integrator: Use set_cpus_allowed_ptr * pm-cpuidle: cpuidle: menu: Lookup CPU runqueues less cpuidle: menu: Call nr_iowait_cpu less times cpuidle: menu: Use ktime_to_us instead of reinventing the wheel cpuidle: menu: Use shifts when calculating averages where possible
2014-08-11seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lockGuenter Roeck
Current upstream kernel hangs with mips and powerpc targets in uniprocessor mode if SECCOMP is configured. Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking"). Turns out that code such as BUG_ON(!spin_is_locked(&list_lock)); can not be used in uniprocessor mode because spin_is_locked() always returns false in this configuration, and that assert_spin_locked() exists for that very purpose and must be used instead. Fixes: dbd952127d11 ("seccomp: introduce writer locking") Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kees Cook <keescook@chromium.org>
2014-08-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Stuff in here: - acct.c fixes and general rework of mnt_pin mechanism. That allows to go for delayed-mntput stuff, which will permit mntput() on deep stack without worrying about stack overflows - fs shutdown will happen on shallow stack. IOW, we can do Eric's umount-on-rmdir series without introducing tons of stack overflows on new mntput() call chains it introduces. - Bruce's d_splice_alias() patches - more Miklos' rename() stuff. - a couple of regression fixes (stable fodder, in the end of branch) and a fix for API idiocy in iov_iter.c. There definitely will be another pile, maybe even two. I'd like to get Eric's series in this time, but even if we miss it, it'll go right in the beginning of for-next in the next cycle - the tricky part of prereqs is in this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) fix copy_tree() regression __generic_file_write_iter(): fix handling of sync error after DIO switch iov_iter_get_pages() to passing maximal number of pages fs: mark __d_obtain_alias static dcache: d_splice_alias should detect loops exportfs: update Exporting documentation dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED dcache: remove unused d_find_alias parameter dcache: d_obtain_alias callers don't all want DISCONNECTED dcache: d_splice_alias should ignore DCACHE_DISCONNECTED dcache: d_splice_alias mustn't create directory aliases dcache: close d_move race in d_splice_alias dcache: move d_splice_alias namei: trivial fix to vfs_rename_dir comment VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode. cifs: support RENAME_NOREPLACE hostfs: support rename flags shmem: support RENAME_EXCHANGE shmem: support RENAME_NOREPLACE btrfs: add RENAME_NOREPLACE ...
2014-08-10Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "This finally applies the stricter sysfs perms checking we pulled out before last merge window. A few stragglers are fixed (thanks linux-next!)" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: arch/powerpc/platforms/powernv/opal-dump.c: fix world-writable sysfs files arch/powerpc/platforms/powernv/opal-elog.c: fix world-writable sysfs files drivers/video/fbdev/s3c2410fb.c: don't make debug world-writable. ARM: avoid ARM binutils leaking ELF local symbols scripts: modpost: Remove numeric suffix pattern matching scripts: modpost: fix compilation warning sysfs: disallow world-writable files. module: return bool from within_module*() module: add within_module() function modules: Fix build error in moduleloader.h
2014-08-09Merge tag 'trace-fixes-3.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull trace file read iterator fixes from Steven Rostedt: "This contains a fix for two long standing bugs. Both of which are rarely ever hit, and requires the user to do something that users rarely do. It took a few special test cases to even trigger this bug, and one of them was just one test in the process of finishing up as another one started. Both bugs have to do with the ring buffer iterator rb_iter_peek(), but one is more indirect than the other. The fist bug fix is simply an increase in the safety net loop counter. The counter makes sure that the rb_iter_peek() only iterates the number of times we expect it can, and no more. Well, there was one way it could iterate one more than we expected, and that caused the ring buffer to shutdown with a nasty warning. The fix was simply to up that counter by one. The other bug has to be with rb_iter_reset() (called by rb_iter_peek()). This happens when a user reads both the trace_pipe and trace files. The trace_pipe is a consuming read and does not use the ring buffer iterator, but the trace file is not a consuming read and does use the ring buffer iterator. When the trace file is being read, if it detects that a consuming read occurred, it resets the iterator and starts over. But the reset code that does this (rb_iter_reset()), checks if the reader_page is linked to the ring buffer or not, and will look into the ring buffer itself if it is not. This is wrong, as it should always try to read the reader page first. Not to mention, the code that looked into the ring buffer did it wrong, and used the header_page "read" offset to start reading on that page. That offset is bogus for pages in the writable ring buffer, and was corrupting the iterator, and it would start returning bogus events" * tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Always reset iterator to reader page ring-buffer: Up rb_iter_peek() loop count to 3
2014-08-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a bunch of small changes built against 3.16-rc6. The most significant change for users is the first patch which makes setns drmatically faster by removing unneded rcu handling. The next chunk of changes are so that "mount -o remount,.." will not allow the user namespace root to drop flags on a mount set by the system wide root. Aks this forces read-only mounts to stay read-only, no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec mounts to stay no exec and it prevents unprivileged users from messing with a mounts atime settings. I have included my test case as the last patch in this series so people performing backports can verify this change works correctly. The next change fixes a bug in NFS that was discovered while auditing nsproxy users for the first optimization. Today you can oops the kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever with pid namespaces. I rebased and fixed the build of the !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given that no one to my knowledge bases anything on my tree fixing the typo in place seems more responsible that requiring a typo-fix to be backported as well. The last change is a small semantic cleanup introducing /proc/thread-self and pointing /proc/mounts and /proc/net at it. This prevents several kinds of problemantic corner cases. It is a user-visible change so it has a minute chance of causing regressions so the change to /proc/mounts and /proc/net are individual one line commits that can be trivially reverted. Unfortunately I lost and could not find the email of the original reporter so he is not credited. From at least one perspective this change to /proc/net is a refgression fix to allow pthread /proc/net uses that were broken by the introduction of the network namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net proc: Implement /proc/thread-self to point at the directory of the current thread proc: Have net show up under /proc/<tgid>/task/<tid> NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes mnt: Add tests for unprivileged remount cases that have found to be faulty mnt: Change the default remount atime from relatime to the existing value mnt: Correct permission checks in do_remount mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount mnt: Only change user settable mount flags in remount namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-09Merge branch 'signal-cleanup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc Pull arch signal handling cleanup from Richard Weinberger: "This patch series moves all remaining archs to the get_signal(), signal_setup_done() and sigsp() functions. Currently these archs use open coded variants of the said functions. Further, unused parameters get removed from get_signal_to_deliver(), tracehook_signal_handler() and signal_delivered(). At the end of the day we save around 500 lines of code." * 'signal-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (43 commits) powerpc: Use sigsp() openrisc: Use sigsp() mn10300: Use sigsp() mips: Use sigsp() microblaze: Use sigsp() metag: Use sigsp() m68k: Use sigsp() m32r: Use sigsp() hexagon: Use sigsp() frv: Use sigsp() cris: Use sigsp() c6x: Use sigsp() blackfin: Use sigsp() avr32: Use sigsp() arm64: Use sigsp() arc: Use sigsp() sas_ss_flags: Remove nested ternary if Rip out get_signal_to_deliver() Clean up signal_delivered() tracehook_signal_handler: Remove sig, info, ka and regs ...
2014-08-08x86: Add "make tinyconfig" to configure the tiniest possible kernelJosh Triplett
Since commit 5d2acfc7b974bbd3858b4dd3f2cdc6362dd8843a ("kconfig: make allnoconfig disable options behind EMBEDDED and EXPERT") in 3.15-rc1, "make allnoconfig" disables every possible config option. However, a few configuration options (CC_OPTIMIZE_FOR_SIZE, OPTIMIZE_INLINING) produce a smaller kernel when turned on, and a few choices exist (compression, highmem, allocator) for which a non-default option produces a smaller kernel. Add a "tinyconfig" option, which starts from allnoconfig and then sets these options to configure the tiniest possible kernel. This provides a better baseline for embedded systems or efforts to reduce kernel size. Signed-off-by: Josh Triplett <josh@joshtriplett.org>
2014-08-08kexec: verify the signature of signed PE bzImageVivek Goyal
This is the final piece of the puzzle of verifying kernel image signature during kexec_file_load() syscall. This patch calls into PE file routines to verify signature of bzImage. If signature are valid, kexec_file_load() succeeds otherwise it fails. Two new config options have been introduced. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. I tested these patches with both "pesign" and "sbsign" signed bzImages. I used signing_key.priv key and signing_key.x509 cert for signing as generated during kernel build process (if module signing is enabled). Used following method to sign bzImage. pesign ====== - Convert DER format cert to PEM format cert openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform PEM - Generate a .p12 file from existing cert and private key file openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in signing_key.x509.PEM - Import .p12 file into pesign db pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign - Sign bzImage pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign -c "Glacier signing key - Magrathea" -s sbsign ====== sbsign --key signing_key.priv --cert signing_key.x509.PEM --output /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+ Patch details: Well all the hard work is done in previous patches. Now bzImage loader has just call into that code and verify whether bzImage signature are valid or not. Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Fleming <matt@console-pimps.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec: support for kexec on panic using new system callVivek Goyal
This patch adds support for loading a kexec on panic (kdump) kernel usning new system call. It prepares ELF headers for memory areas to be dumped and for saved cpu registers. Also prepares the memory map for second kernel and limits its boot to reserved areas only. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec-bzImage64: support for loading bzImage using 64bit entryVivek Goyal
This is loader specific code which can load bzImage and set it up for 64bit entry. This does not take care of 32bit entry or real mode entry. 32bit mode entry can be implemented if somebody needs it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec: load and relocate purgatory at kernel load timeVivek Goyal
Load purgatory code in RAM and relocate it based on the location. Relocation code has been inspired by module relocation code and purgatory relocation code in kexec-tools. Also compute the checksums of loaded kexec segments and store them in purgatory. Arch independent code provides this functionality so that arch dependent bootloaders can make use of it. Helper functions are provided to get/set symbol values in purgatory which are used by bootloaders later to set things like stack and entry point of second kernel etc. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec: implementation of new syscall kexec_file_loadVivek Goyal
Previous patch provided the interface definition and this patch prvides implementation of new syscall. Previously segment list was prepared in user space. Now user space just passes kernel fd, initrd fd and command line and kernel will create a segment list internally. This patch contains generic part of the code. Actual segment preparation and loading is done by arch and image specific loader. Which comes in next patch. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec: new syscall kexec_file_load() declarationVivek Goyal
This is the new syscall kexec_file_load() declaration/interface. I have reserved the syscall number only for x86_64 so far. Other architectures (including i386) can reserve syscall number when they enable the support for this new syscall. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Borislav Petkov <bp@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08resource: provide new functions to walk through resourcesVivek Goyal
I have added two more functions to walk through resources. Currently walk_system_ram_range() deals with pfn and /proc/iomem can contain partial pages. By dealing in pfn, callback function loses the info that last page of a memory range is a partial page and not the full page. So I implemented walk_system_ram_res() which returns u64 values to callback functions and now it properly return start and end address. walk_system_ram_range() uses find_next_system_ram() to find the next ram resource. This in turn only travels through siblings of top level child and does not travers through all the nodes of the resoruce tree. I also need another function where I can walk through all the resources, for example figure out where "GART" aperture is. Figure out where ACPI memory is. So I wrote another function walk_iomem_res() which walks through all /proc/iomem resources and returns matches as asked by caller. Caller can specify "name" of resource, start and end and flags. Got rid of find_next_system_ram_res() and instead implemented more generic find_next_iomem_res() which can be used to traverse top level children only based on an argument. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08kexec: use common function for kimage_normal_alloc() and kimage_crash_alloc()Vivek Goyal
kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar things and differ only little. So instead of having two separate functions create a common function kimage_alloc_init() and pass it the "flags" argument which tells whether it is normal kexec or kexec_on_panic. And this function should be able to deal with both the cases. This consolidation also helps later where we can use a common function kimage_file_alloc_init() to handle normal and crash cases for new file based kexec syscall. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>