summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-09-07rcu: Add call_rcu_tasks()Paul E. McKenney
This commit adds a new RCU-tasks flavor of RCU, which provides call_rcu_tasks(). This RCU flavor's quiescent states are voluntary context switch (not preemption!) and userspace execution (not the idle loop -- use some sort of schedule_on_each_cpu() if you need to handle the idle tasks. Note that unlike other RCU flavors, these quiescent states occur in tasks, not necessarily CPUs. Includes fixes from Steven Rostedt. This RCU flavor is assumed to have very infrequent latency-tolerant updaters. This assumption permits significant simplifications, including a single global callback list protected by a single global lock, along with a single task-private linked list containing all tasks that have not yet passed through a quiescent state. If experience shows this assumption to be incorrect, the required additional complexity will be added. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcutorture: Add callback-flood testPaul E. McKenney
Although RCU is designed to handle arbitrary floods of callbacks, this capability is not routinely tested. This commit therefore adds a cbflood capability in which kthreads repeatedly registers large numbers of callbacks. One such kthread is created for each four CPUs (rounding up), and the test may be controlled by several cbflood_* kernel boot parameters, which control the number of bursts per flood, the number of callbacks per burst, the time between bursts, and the time between floods. The default values are large enough to exercise RCU's emergency responses to callback flooding. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Miller <davem@davemloft.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-09-07rcu: Use pr_alert/pr_cont for printing logsJoe Perches
User pr_alert/pr_cont for printing the logs from rcutorture module directly instead of writing it to a buffer and then printing it. This allows us from not having to allocate such buffers. Also remove a resulting empty function. I tested this using the parse-torture.sh script as follows: $ dmesg | grep torture > log.txt $ bash parse-torture.sh log.txt test $ There were no warnings which means that parsing went fine. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcutorture: Fix a sparse warning by marking boost_mutex staticPranith Kumar
This commit fixes the following sparse warning by marking boost_mutex static: kernel/rcu/rcutorture.c:185:1: warning: symbol 'boost_mutex' was not declared. Should it be static? Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07rcu: Replace flush_signals() with WARN_ON(signal_pending())Paul E. McKenney
Currently, when RCU awakens from a wait_event_interruptible() that might have awakened prematurely, it does a flush_signals(). This is done on the off-chance that someone figured out how to deliver a signal to a kthread, which is supposed to be impossible. Given that this is supposed to be impossible, this commit changes the flush_signals() calls into WARN_ON(signal_pending()). Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreadsPranith Kumar
The rcu_gp_kthread_wake() function checks for three conditions before waking up grace period kthreads: * Is the thread we are trying to wake up the current thread? * Are the gp_flags zero? (all threads wait on non-zero gp_flags condition) * Is there no thread created for this flavour, hence nothing to wake up? If any one of these condition is true, we do not call wake_up(). It was found that there are quite a few avoidable wake ups both during idle time and under stress induced by rcutorture. Idle: Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0 Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0 rcutorture: Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0 Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0 Here case{1-3} are the cases listed above. We can avoid these wake ups by using rcu_gp_kthread_wake() to conditionally wake up the grace period kthreads. There is a comment about an implied barrier supplied by the wake_up() logic. This barrier is necessary for the awakened thread to see the updated ->gp_flags. This flag is always being updated with the root node lock held. Also, the awakened thread tries to acquire the root node lock before reading ->gp_flags because of which there is proper ordering. Hence this commit tries to avoid calling wake_up() whenever we can by using rcu_gp_kthread_wake() function. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Make TINY_RCU tinier by putting error checks under #ifdefPaul E. McKenney
The rcu_idle_enter_common() and rcu_idle_exit_common() functions contain error checks that have to the best of my knowledge have never triggered over the past several years. These are nevertheless valuable when creating new architectures or doing other low-level changes, so the checks should not be deleted. This commit instead places these checks under #ifdef CONFIG_RCU_TRACE so that they are executed only when specifically requested. The savings are significant: Before: text data bss dec hex filename 1749 39 0 1788 6fc /tmp/b/kernel/rcu/tiny.o 632 152 0 784 310 /tmp/b/kernel/rcu/update.o ---- 2572 After: text data bss dec hex filename 1281 37 0 1318 526 /tmp/b/kernel/rcu/tiny.o 632 152 0 784 310 /tmp/b/kernel/rcu/update.o ---- 2102 This amounts to 470 bytes, or 18% of the original. Switched from #ifdef to IS_ENABLED() on Josh Triplett's advice. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07rcu: Break more call_rcu() deadlock involving scheduler and perfPaul E. McKenney
Commit 96d3fd0d315a9 (rcu: Break call_rcu() deadlock involving scheduler and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake the rcuo kthread due to the queue being initially empty, but did not do anything for the case where the queue was overflowing. This commit therefore also defers wakeup for the overflow case. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Remove stale comment in tree.cPranith Kumar
This commit removes a stale comment in rcu/tree.c which was left out when some code was moved around previously in commit 2036d94a7b61 ("rcu: Rework detection of use of RCU by offline CPUs") For reference, the following updated comment exists a few lines below this which means the same: /* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */ Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Update tiny.c references to tree.cPranith Kumar
This commit updates the references to rcutree.c which is now rcu/tree.c Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Define tracepoint strings only if CONFIG_TRACING is setArd Biesheuvel
Commit f7f7bac9cb1c ("rcu: Have the RCU tracepoints use the tracepoint_string infrastructure") unconditionally populates the __tracepoint_str input section, but this section is not assigned an output section if CONFIG_TRACING is not set. This results in the __tracepoint_str turning up in unexpected places, i.e., after _edata. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Uninline rcu_read_lock_held()Oleg Nesterov
This commit uninlines rcu_read_lock_held(). According to "size vmlinux" this saves 28549 in .text: - 5541731 3014560 14757888 23314179 + 5513182 3026848 14757888 23297918 Note: it looks as if the data grows by 12288 bytes but this is not true, it does not actually grow. But .data starts with ALIGN(THREAD_SIZE) and since .text shrinks the padding grows, and thus .data grows too as it seen by /bin/size. diff System.map: - ffffffff81510000 D _sdata - ffffffff81510000 D init_thread_union + ffffffff81509000 D _sdata + ffffffff8150c000 D init_thread_union Perhaps we can change vmlinux.lds.S to .data itself, so that /bin/size can't "wrongly" report that .data grows if .text shinks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Use true/false instead of 1/0 for a bool typePranith Kumar
This commit uses true/false instead of 1/0 for bool types in rcu_gp_fqs() and force_qs_rnp(). Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Return bool type for rcu_try_advance_all_cbs()Pranith Kumar
Return a bool type instead of 0 in rcu_try_advance_all_cbs(). Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Use bool type for return value in rcu_is_watching()Pranith Kumar
Use a bool type for return in rcu_is_watching(). Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Fix sparse warning about rcu_batches_completed_preempt() being non-staticPranith Kumar
fix sparse warning about rcu_batches_completed_preempt() being non-static by marking it as static Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Remove remaining read-modify-write ACCESS_ONCE() callsPranith Kumar
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07Merge tag 'pm+acpi-3.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "These are regression fixes (ACPI sysfs, ACPI video, suspend test), ACPI cpuidle deadlock fix, missing runtime validation of ACPI _DSD output, a fix and a new CPU ID for the RAPL driver, new blacklist entry for the ACPI EC driver and a couple of trivial cleanups (intel_pstate and generic PM domains). Specifics: - Fix for recently broken test_suspend= command line argument (Rafael Wysocki). - Fixes for regressions related to the ACPI video driver caused by switching the default to native backlight handling in 3.16 from Hans de Goede. - Fix for a sysfs attribute of ACPI device objects that returns stale values sometimes due to the fact that they are cached instead of executing the appropriate method (_SUN) every time (broken in 3.14). From Yasuaki Ishimatsu. - Fix for a deadlock between cpuidle_lock and cpu_hotplug.lock in the ACPI processor driver from Jiri Kosina. - Runtime output validation for the ACPI _DSD device configuration object missing from the support for it that has been introduced recently. From Mika Westerberg. - Fix for an unuseful and misleading RAPL (Running Average Power Limit) domain detection message in the RAPL driver from Jacob Pan. - New Intel Haswell CPU ID for the RAPL driver from Jason Baron. - New Clevo W350etq blacklist entry for the ACPI EC driver from Lan Tianyu. - Cleanup for the intel_pstate driver and the core generic PM domains code from Gabriele Mazzotta and Geert Uytterhoeven" * tag 'pm+acpi-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock ACPI / scan: not cache _SUN value in struct acpi_device_pnp cpufreq: intel_pstate: Remove unneeded variable powercap / RAPL: change domain detection message powercap / RAPL: add support for CPU model 0x3f PM / domains: Make generic_pm_domain.name const PM / sleep: Fix test_suspend= command line option ACPI / EC: Add msi quirk for Clevo W350etq ACPI / video: Disable native_backlight on HP ENVY 15 Notebook PC ACPI / video: Add a disable_native_backlight quirk ACPI / video: Fix use_native_backlight selection logic ACPICA: ACPI 5.1: Add support for runtime validation of _DSD package.
2014-09-07Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fix from Ingo Molnar: "A boot hang fix for the offloaded callback RCU model (RCU_NOCB_CPU=y && (TREE_CPU=y || TREE_PREEMPT_RC)) in certain bootup scenarios" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Make nocb leader kthreads process pending callbacks after spawning
2014-09-07Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Three fixlets from the timer departement: - Update the timekeeper before updating vsyscall and pvclock. This fixes the kvm-clock regression reported by Chris and Paolo. - Use the proper irq work interface from NMI. This fixes the regression reported by Catalin and Dave. - Clarify the compat_nanosleep error handling mechanism to avoid future confusion" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Update timekeeper before updating vsyscall and pvclock compat: nanosleep: Clarify error handling nohz: Restore NMI safe local irq work for local nohz kick
2014-09-07sched/deadline: Fix a precision problem in the microseconds rangexiaofeng.yan
An overrun could happen in function start_hrtick_dl() when a task with SCHED_DEADLINE runs in the microseconds range. For example, if a task with SCHED_DEADLINE has the following parameters: Task runtime deadline period P1 200us 500us 500us The deadline and period from task P1 are less than 1ms. In order to achieve microsecond precision, we need to enable HRTICK feature by the next command: PC#echo "HRTICK" > /sys/kernel/debug/sched_features PC#trace-cmd record -e sched_switch & PC#./schedtool -E -t 200000:500000:500000 -e ./test The binary test is in an endless while(1) loop here. Some pieces of trace.dat are as follows: <idle>-0 157.603157: sched_switch: :R ==> 2481:4294967295: test test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2 <idle>-0 157.605657: sched_switch: :R ==> 2481:4294967295: test test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test We can get the runtime of P1 from the information above: runtime = 157.608183 - 157.605657 runtime = 0.002526(2.526ms) The correct runtime should be less than or equal to 200us at some point. The problem is caused by a conditional judgment "delta > 10000" in function start_hrtick_dl(). Because no hrtimer start up to control the rest of runtime when the reset of runtime is less than 10us. So the process will continue to run until tick-period is coming. Move the code with the limit of the least time slice from hrtick_start_fair() to hrtick_start() because the EDF schedule class also needs this function in start_hrtick_dl(). To fix this problem, we call hrtimer_start() unconditionally in start_hrtick_dl(), and make sure the scheduling slice won't be smaller than 10us in hrtimer_start(). Signed-off-by: Xiaofeng Yan <xiaofeng.yan@huawei.com> Reviewed-by: Li Zefan <lizefan@huawei.com> Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com [ Massaged the changelog and the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-06timekeeping: Update timekeeper before updating vsyscall and pvclockThomas Gleixner
The update_walltime() code works on the shadow timekeeper to make the seqcount protected region as short as possible. But that update to the shadow timekeeper does not update all timekeeper fields because it's sufficient to do that once before it becomes life. One of these fields is tkr.base_mono. That stays stale in the shadow timekeeper unless an operation happens which copies the real timekeeper to the shadow. The update function is called after the update calls to vsyscall and pvclock. While not correct, it did not cause any problems because none of the invoked update functions used base_mono. commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based) changed that in the kvm pvclock update function, so the stale mono_base value got used and caused kvm-clock to malfunction. Put the update where it belongs and fix the issue. Reported-by: Chris J Arges <chris.j.arges@canonical.com> Reported-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-06compat: nanosleep: Clarify error handlingThomas Gleixner
The error handling in compat_sys_nanosleep() is correct, but completely non obvious. Document it and restrict it to the -ERESTART_RESTARTBLOCK return value for clarity. Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-05net: bpf: make eBPF interpreter images read-onlyDaniel Borkmann
With eBPF getting more extended and exposure to user space is on it's way, hardening the memory range the interpreter uses to steer its command flow seems appropriate. This patch moves the to be interpreted bytecode to read-only pages. In case we execute a corrupted BPF interpreter image for some reason e.g. caused by an attacker which got past a verifier stage, it would not only provide arbitrary read/write memory access but arbitrary function calls as well. After setting up the BPF interpreter image, its contents do not change until destruction time, thus we can setup the image on immutable made pages in order to mitigate modifications to that code. The idea is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit against spraying attacks"). This is possible because bpf_prog is not part of sk_filter anymore. After setup bpf_prog cannot be altered during its life-time. This prevents any modifications to the entire bpf_prog structure (incl. function/JIT image pointer). Every eBPF program (including classic BPF that are migrated) have to call bpf_prog_select_runtime() to select either interpreter or a JIT image as a last setup step, and they all are being freed via bpf_prog_free(), including non-JIT. Therefore, we can easily integrate this into the eBPF life-time, plus since we directly allocate a bpf_prog, we have no performance penalty. Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual inspection of kernel_page_tables. Brad Spengler proposed the same idea via Twitter during development of this patch. Joint work with Hannes Frederic Sowa. Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05sched/fair: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: paulmck@linux.vnet.ibm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140822145043.GA580@ada Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-04resources: Add device-managed request/release_resource()Thierry Reding
Provide device-managed implementations of the request_resource() and release_resource() functions. Upon failure to request a resource, the new devm_request_resource() function will output an error message for consistent error reporting. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Tejun Heo <tj@kernel.org>
2014-09-04nohz: Restore NMI safe local irq work for local nohz kickFrederic Weisbecker
The local nohz kick is currently used by perf which needs it to be NMI-safe. Recent commit though (7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9) changed its implementation to fire the local kick using the remote kick API. It was convenient to make the code more generic but the remote kick isn't NMI-safe. As a result: WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140() CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34 0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37 0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180 0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000 Call Trace: <NMI> [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a [<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0 [<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20 [<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140 [<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90 [<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350 [<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0 [<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150 [<ffffffff9a187934>] perf_event_overflow+0x14/0x20 [<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410 [<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130 [<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff9a007b72>] nmi_handle+0xd2/0x390 [<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390 [<ffffffff9a0d131b>] ? lock_release+0xab/0x330 [<ffffffff9a008062>] default_do_nmi+0x72/0x1c0 [<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200 [<ffffffff9a008268>] do_nmi+0xb8/0x100 Lets fix this by restoring the use of local irq work for the nohz local kick. Reported-by: Catalin Iacob <iacobcatalin@gmail.com> Reported-and-tested-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-05cgroup: check cgroup liveliness before unbreaking kernfsLi Zefan
When cgroup_kn_lock_live() is called through some kernfs operation and another thread is calling cgroup_rmdir(), we'll trigger the warning in cgroup_get(). ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0() ... Call Trace: [<c16ee73d>] dump_stack+0x41/0x52 [<c10468ef>] warn_slowpath_common+0x7f/0xa0 [<c104692d>] warn_slowpath_null+0x1d/0x20 [<c10bb999>] cgroup_get+0x89/0xa0 [<c10bbe58>] cgroup_kn_lock_live+0x28/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 ---[ end trace 6f2e0c38c2108a74 ]--- Fix this by calling css_tryget() instead of cgroup_get(). v2: - move cgroup_tryget() right below cgroup_get() definition. (Tejun) Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-05cgroup: delay the clearing of cgrp->kn->privLi Zefan
Run these two scripts concurrently: for ((; ;)) { mkdir /cgroup/sub rmdir /cgroup/sub } for ((; ;)) { echo $$ > /cgroup/sub/cgroup.procs echo $$ > /cgroup/cgroup.procs } A kernel bug will be triggered: BUG: unable to handle kernel NULL pointer dereference at 00000038 IP: [<c10bbd69>] cgroup_put+0x9/0x80 ... Call Trace: [<c10bbe19>] cgroup_kn_unlock+0x39/0x50 [<c10bbe91>] cgroup_kn_lock_live+0x61/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another concurrent thread can access kn->priv after the clearing. We should move the clearing to css_release_work_fn(). At that time no one is holding reference to the cgroup and no one can gain a new reference to access it. v2: - move RCU_INIT_POINTER() into the else block. (Tejun) - remove the cgroup_parent() check. (Tejun) - update the comment in css_tryget_online_from_dir(). Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-04locking/semaphore: Resolve some shadow warningsMark Rustad
Resolve some shadow warnings resulting from using the name jiffies, which is a well-known global. This is not a problem of course, but it could be a trap for someone copying and pasting code, and it just makes W=2 a little cleaner. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1409739444-13635-1-git-send-email-jeffrey.t.kirsher@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-03seccomp: Allow arch code to provide seccomp_dataAndy Lutomirski
populate_seccomp_data is expensive: it works by inspecting task_pt_regs and various other bits to piece together all the information, and it's does so in multiple partially redundant steps. Arch-specific code in the syscall entry path can do much better. Admittedly this adds a bit of additional room for error, but the speedup should be worth it. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>
2014-09-03seccomp: Refactor the filter callback and the APIAndy Lutomirski
The reason I did this is to add a seccomp API that will be usable for an x86 fast path. The x86 entry code needs to use a rather expensive slow path for a syscall that might be visible to things like ptrace. By splitting seccomp into two phases, we can check whether we need the slow path and then use the fast path in if the filter allows the syscall or just returns some errno. As a side effect, I think the new code is much easier to understand than the old code. This has one user-visible effect: the audit record written for SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE happened. It used to depend in a complicated way on what the tracer did. I couldn't make much sense of it. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>
2014-09-03seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computingAndy Lutomirski
The secure_computing function took a syscall number parameter, but it only paid any attention to that parameter if seccomp mode 1 was enabled. Rather than coming up with a kludge to get the parameter to work in mode 2, just remove the parameter. To avoid churn in arches that don't have seccomp filters (and may not even support syscall_get_nr right now), this leaves the parameter in secure_computing_strict, which is now a real function. For ARM, this is a bit ugly due to the fact that ARM conditionally supports seccomp filters. Fixing that would probably only be a couple of lines of code, but it should be coordinated with the audit maintainers. This will be a slight slowdown on some arches. The right fix is to pass in all of seccomp_data instead of trying to make just the syscall nr part be fast. This is a prerequisite for making two-phase seccomp work cleanly. Cc: Russell King <linux@arm.linux.org.uk> Cc: linux-arm-kernel@lists.infradead.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>
2014-09-03genirq: Add irq_domain-aware core IRQ handlerMarc Zyngier
Calling irq_find_mapping from outside a irq_{enter,exit} section is unsafe and produces ugly messages if CONFIG_PROVE_RCU is enabled: If coming from the idle state, the rcu_read_lock call in irq_find_mapping will generate an unpleasant warning: <quote> =============================== [ INFO: suspicious RCU usage. ] 3.16.0-rc1+ #135 Not tainted ------------------------------- include/linux/rcupdate.h:871 rcu_read_lock() used illegally while idle! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! 1 lock held by swapper/0/0: #0: (rcu_read_lock){......}, at: [<ffffffc00010206c>] irq_find_mapping+0x4c/0x198 </quote> As this issue is fairly widespread and involves at least three different architectures, a possible solution is to add a new handle_domain_irq entry point into the generic IRQ code that the interrupt controller code can call. This new function takes an irq_domain, and calls into irq_find_domain inside the irq_{enter,exit} block. An additional "lookup" parameter is used to allow non-domain architecture code to be replaced by this as well. Interrupt controllers can then be updated to use the new mechanism. This code is sitting behind a new CONFIG_HANDLE_DOMAIN_IRQ, as not all architectures implement set_irq_regs (yes, mn10300, I'm looking at you...). Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1409047421-27649-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Jason Cooper <jason@lakedaemon.net>
2014-09-03Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull an RCU fix from Paul E. McKenney: "This series contains a single commit fixing an initialization bug reported by Amit Shah and fixed by Pranith Kumar (and tested by Amit). This bug results in a boot-time hang in callback-offloaded configurations where callbacks were posted before the offloading ('rcuo') kthreads were created." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-03PM / sleep: Fix test_suspend= command line optionRafael J. Wysocki
After commit d431cbc53cb7 (PM / sleep: Simplify sleep states sysfs interface code) the pm_states[] array is not populated initially, which causes setup_test_suspend() to always fail and the suspend testing during boot doesn't work any more. Fix the problem by using pm_labels[] instead of pm_states[] in setup_test_suspend() and storing a pointer to the label of the sleep state to test rather than the number representing it, because the connection between the state numbers and labels is only established by suspend_set_ops(). Fixes: d431cbc53cb7 (PM / sleep: Simplify sleep states sysfs interface code) Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq handling fixlet from Thomas Gleixner: "Just an export for an interrupt flow handler which is now used in gpio modules" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irq: Export handle_fasteoi_irq
2014-09-01genirq: Simplify wakeup mechanismThomas Gleixner
Currently we suspend wakeup interrupts by lazy disabling them and check later whether the interrupt has fired, but that's not sufficient for suspend to idle as there is no way to check that once we transitioned into the CPU idle state. So we change the mechanism in the following way: 1) Leave the wakeup interrupts enabled across suspend 2) Add a check to irq_may_run() which is called at the beginning of each flow handler whether the interrupt is an armed wakeup source. This check is basically free as it just extends the existing check for IRQD_IRQ_INPROGRESS. So no new conditional in the hot path. If the IRQD_WAKEUP_ARMED flag is set, then the interrupt is disabled, marked as pending/suspended and the pm core is notified about the wakeup event. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ rjw: syscore.c and put irq_pm_check_wakeup() into pm.c ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Mark wakeup sources as armed on suspendThomas Gleixner
This allows us to utilize this information in the irq_may_run() check without adding another conditional to the fast path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Create helper for flow handler entry checkThomas Gleixner
All flow handlers - except the per cpu ones - check for an interrupt in progress and an eventual concurrent polling on another cpu. Create a helper function for the repeated code pattern. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Distangle edge handler entryThomas Gleixner
If the interrupt is disabled or has no action, then we should not call the poll check. Separate the checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Avoid double loop on suspendThomas Gleixner
We can synchronize the suspended interrupts right away. No need for an extra loop. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Move MASK_ON_SUSPEND handling into suspend_device_irqs()Thomas Gleixner
There is no reason why we should delay the masking of interrupts whose interrupt chip requests MASK_ON_SUSPEND to the point where we check the wakeup interrupts. We can do it right at the point where we mark the interrupt as suspended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Make use of pm misfeature accountingThomas Gleixner
Use the accounting fields which got introduced for snity checking for the various PM options. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Add sanity checks for PM options on shared interrupt linesThomas Gleixner
Account the IRQF_NO_SUSPEND and IRQF_RESUME_EARLY actions on shared interrupt lines and yell loudly if there is a mismatch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01genirq: Move suspend/resume logic into irq/pm codeThomas Gleixner
No functional change. Preparatory patch for cleaning up the suspend abort functionality. Update the comments while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01PM / sleep: Mechanism for aborting system suspends unconditionallyRafael J. Wysocki
It sometimes may be necessary to abort a system suspend in progress or wake up the system from suspend-to-idle even if the pm_wakeup_event()/pm_stay_awake() mechanism is not enabled. For this purpose, introduce a new global variable pm_abort_suspend and make pm_wakeup_pending() check its value. Also add routines for manipulating that variable. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-29kexec: create a new config option CONFIG_KEXEC_FILE for new syscallVivek Goyal
Currently new system call kexec_file_load() and all the associated code compiles if CONFIG_KEXEC=y. But new syscall also compiles purgatory code which currently uses gcc option -mcmodel=large. This option seems to be available only gcc 4.4 onwards. Hiding new functionality behind a new config option will not break existing users of old gcc. Those who wish to enable new functionality will require new gcc. Having said that, I am trying to figure out how can I move away from using -mcmodel=large but that can take a while. I think there are other advantages of introducing this new config option. As this option will be enabled only on x86_64, other arches don't have to compile generic kexec code which will never be used. This new code selects CRYPTO=y and CRYPTO_SHA256=y. And all other arches had to do this for CONFIG_KEXEC. Now with introduction of new config option, we can remove crypto dependency from other arches. Now CONFIG_KEXEC_FILE is available only on x86_64. So whereever I had CONFIG_X86_64 defined, I got rid of that. For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to "depends on CRYPTO=y". This should be safer as "select" is not recursive. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29resource: fix the case of null pointer accessVivek Goyal
Richard and Daniel reported that UML is broken due to changes to resource traversal functions. Problem is that iomem_resource.child can be null and new code does not consider that possibility. Old code used a for loop and that loop will not even execute if p was null. Revert back to for() loop logic and bail out if p is null. I also moved sibling_only check out of resource_lock. There is no reason to keep it inside the lock. Following is backtrace of the UML crash. RIP: 0033:[<0000000060039b9f>] RSP: 0000000081459da0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 00000000219b3fff RCX: 000000006010d1d9 RDX: 0000000000000001 RSI: 00000000602dfb94 RDI: 0000000081459df8 RBP: 0000000081459de0 R08: 00000000601b59f4 R09: ffffffff0000ff00 R10: ffffffff0000ff00 R11: 0000000081459e88 R12: 0000000081459df8 R13: 00000000219b3fff R14: 00000000602dfb94 R15: 0000000000000000 Kernel panic - not syncing: Segfault with no mm CPU: 0 PID: 1 Comm: swapper Not tainted 3.16.0-10454-g58d08e3 #13 Stack: 00000000 000080d0 81459df0 219b3fff 81459e70 6010d1d9 ffffffff 6033e010 81459e50 6003a269 81459e30 00000000 Call Trace: [<6010d1d9>] ? kclist_add_private+0x0/0xe7 [<6003a269>] walk_system_ram_range+0x61/0xb7 [<6000e859>] ? proc_kcore_init+0x0/0xf1 [<6010d574>] kcore_update_ram+0x4c/0x168 [<6010d72e>] ? kclist_add+0x0/0x2e [<6000e943>] proc_kcore_init+0xea/0xf1 [<6000e859>] ? proc_kcore_init+0x0/0xf1 [<6000e859>] ? proc_kcore_init+0x0/0xf1 [<600189f0>] do_one_initcall+0x13c/0x204 [<6004ca46>] ? parse_args+0x1df/0x2e0 [<6004c82d>] ? parameq+0x0/0x3a [<601b5990>] ? strcpy+0x0/0x18 [<60001e1a>] kernel_init_freeable+0x240/0x31e [<6026f1c0>] kernel_init+0x12/0x148 [<60019fad>] new_thread_handler+0x81/0xa3 Fixes 8c86e70acead629aacb4a ("resource: provide new functions to walk through resources"). Reported-by: Daniel Walter <sahne@0x90.at> Tested-by: Richard Weinberger <richard@nod.at> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Tested-by: Daniel Walter <sahne@0x90.at> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-28genirq: fix reference in devm_request_threaded_irq commentEmilio López
It should be request_threaded_irq, not request_irq [jkosina@suse.cz: not that it would matter, as both have the same set of arguments anyway, but for sake of consistency ...] Signed-off-by: Emilio López <emilio@elopez.com.ar> Signed-off-by: Jiri Kosina <jkosina@suse.cz>