summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-02-23audit: don't reset working wait time accidentally with auditdRichard Guy Briggs
During a queue overflow condition while we are waiting for auditd to drain the queue to make room for regular messages, we don't want a successful auditd that has bypassed the queue check to reset the backlog wait time. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-02-23audit: don't lose set wait time on first successful call to audit_log_start()Richard Guy Briggs
Copy the set wait time to a working value to avoid losing the set value if the queue overflows. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-02-23audit: move the tree pruning to a dedicated threadImre Palik
When file auditing is enabled, during a low memory situation, a memory allocation with __GFP_FS can lead to pruning the inode cache. Which can, in turn lead to audit_tree_freeing_mark() being called. This can call audit_schedule_prune(), that tries to fork a pruning thread, and waits until the thread is created. But forking needs memory, and the memory allocations there are done with __GFP_FS. So we are waiting merrily for some __GFP_FS memory allocations to complete, while holding some filesystem locks. This can take a while ... This patch creates a single thread for pruning the tree from audit_add_tree_rule(), and thus avoids the deadlock that the on-demand thread creation can cause. Reported-by: Matt Wilson <msw@amazon.com> Cc: Matt Wilson <msw@amazon.com> Signed-off-by: Imre Palik <imrep@amazon.de> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-02-22livepatch: RCU protect struct klp_func all the time when used in ↵Petr Mladek
klp_ftrace_handler() func->new_func has been accessed after rcu_read_unlock() in klp_ftrace_handler() and therefore the access was not protected. Signed-off-by: Petr Mladek <pmladek@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-02-21Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds
Pull MIPS updates from Ralf Baechle: "This is the main pull request for MIPS: - a number of fixes that didn't make the 3.19 release. - a number of cleanups. - preliminary support for Cavium's Octeon 3 SOCs which feature up to 48 MIPS64 R3 cores with FPU and hardware virtualization. - support for MIPS R6 processors. Revision 6 of the MIPS architecture is a major revision of the MIPS architecture which does away with many of original sins of the architecture such as branch delay slots. This and other changes in R6 require major changes throughout the entire MIPS core architecture code and make up for the lion share of this pull request. - finally some preparatory work for eXtendend Physical Address support, which allows support of up to 40 bit of physical address space on 32 bit processors" [ Ahh, MIPS can't leave the PAE brain damage alone. It's like every CPU architect has to make that mistake, but pee in the snow by changing the TLA. But whether it's called PAE, LPAE or XPA, it's horrid crud - Linus ] * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits) MIPS: sead3: Corrected get_c0_perfcount_int MIPS: mm: Remove dead macro definitions MIPS: OCTEON: irq: add CIB and other fixes MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs. MIPS: OCTEON: More OCTEONIII support MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits. MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup. MIPS: OCTEON: Update octeon-model.h code for new SoCs. MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h MIPS: OCTEON: Implement the core-16057 workaround MIPS: OCTEON: Delete unused COP2 saving code MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register MIPS: OCTEON: Save and restore CP2 SHA3 state MIPS: OCTEON: Fix FP context save. MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs MIPS: boot: Provide more uImage options MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h MIPS: ip22-gio: Remove legacy suspend/resume support mips: pci: Add ifdef around pci_proc_domain ...
2015-02-21Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull ntp fix from Ingo Molnar: "An adjtimex interface regression fix for 32-bit systems" [ A check that was added in a previous commit is really only a concern for 64bit systems, but was applied to both 32 and 64bit systems, which results in breaking 32bit systems. Thus the fix here is to make the check only apply to 64bit systems ] * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ntp: Fixup adjtimex freq validation on 32-bit systems
2015-02-21Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two fixes: the paravirt spin_unlock() corruption/crash fix, and an rtmutex NULL dereference crash fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/spinlocks/paravirt: Fix memory corruption on unlock locking/rtmutex: Avoid a NULL pointer dereference on deadlock
2015-02-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Thiscontains misc fixes: preempt_schedule_common() and io_schedule() recursion fixes, sched/dl fixes, a completion_done() revert, two sched/rt fixes and a comment update patch" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/rt: Avoid obvious configuration fail sched/autogroup: Fix failure to set cpu.rt_runtime_us sched/dl: Do update_rq_clock() in yield_task_dl() sched: Prevent recursion in io_schedule() sched/completion: Serialize completion_done() with complete() sched: Fix preempt_schedule_common() triggering tracing recursion sched/dl: Prevent enqueue of a sleeping task in dl_task_timer() sched: Make dl_task_time() use task_rq_lock() sched: Clarify ordering between task_rq_lock() and move_queued_task()
2015-02-21Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rcu fix and x86 irq fix from Ingo Molnar: - Fix a bug that caused an RCU warning splat. - Two x86 irq related fixes: a hotplug crash fix and an ACPI IRQ registry fix. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Clear need_qs flag to prevent splat * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable() x86/irq: Fix regression caused by commit b568b8601f05
2015-02-20Merge tag 'for_linux-3.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb Pull kgdb/kdb updates from Jason Wessel: "KGDB/KDB New: - KDB: improved searching - No longer enter debug core on panic if panic timeout is set KGDB/KDB regressions / cleanups - fix pdf doc build errors - prevent junk characters on kdb console from printk levels" * tag 'for_linux-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: kgdb, docs: Fix <para> pdfdocs build errors debug: prevent entering debug mode on panic/exception. kdb: Const qualifier for kdb_getstr's prompt argument kdb: Provide forward search at more prompt kdb: Fix a prompt management bug when using | grep kdb: Remove stack dump when entering kgdb due to NMI kdb: Avoid printing KERN_ levels to consoles kdb: Fix off by one error in kdb_cpu() kdb: fix incorrect counts in KDB summary command output
2015-02-19debug: prevent entering debug mode on panic/exception.Colin Cross
On non-developer devices, kgdb prevents the device from rebooting after a panic. Incase of panics and exceptions, to allow the device to reboot, prevent entering debug mode to avoid getting stuck waiting for the user to interact with debugger. To avoid entering the debugger on panic/exception without any extra configuration, panic_timeout is being used which can be set via /proc/sys/kernel/panic at run time and CONFIG_PANIC_TIMEOUT sets the default value. Setting panic_timeout indicates that the user requested machine to perform unattended reboot after panic. We dont want to get stuck waiting for the user input incase of panic. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: kgdb-bugreport@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Cc: Android Kernel Team <kernel-team@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Colin Cross <ccross@android.com> [Kiran: Added context to commit message. panic_timeout is used instead of break_on_panic and break_on_exception to honor CONFIG_PANIC_TIMEOUT Modified the commit as per community feedback] Signed-off-by: Kiran Raparthy <kiran.kumar@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Const qualifier for kdb_getstr's prompt argumentDaniel Thompson
All current callers of kdb_getstr() can pass constant pointers via the prompt argument. This patch adds a const qualification to make explicit the fact that this is safe. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Provide forward search at more promptDaniel Thompson
Currently kdb allows the output of comamnds to be filtered using the | grep feature. This is useful but does not permit the output emitted shortly after a string match to be examined without wading through the entire unfiltered output of the command. Such a feature is particularly useful to navigate function traces because these traces often have a useful trigger string *before* the point of interest. This patch reuses the existing filtering logic to introduce a simple forward search to kdb that can be triggered from the more prompt. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Fix a prompt management bug when using | grepDaniel Thompson
Currently when the "| grep" feature is used to filter the output of a command then the prompt is not displayed for the subsequent command. Likewise any characters typed by the user are also not echoed to the display. This rather disconcerting problem eventually corrects itself when the user presses Enter and the kdb_grepping_flag is cleared as kdb_parse() tries to make sense of whatever they typed. This patch resolves the problem by moving the clearing of this flag from the middle of command processing to the beginning. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Remove stack dump when entering kgdb due to NMIDaniel Thompson
Issuing a stack dump feels ergonomically wrong when entering due to NMI. Entering due to NMI is normally a reaction to a user request, either the NMI button on a server or a "magic knock" on a UART. Therefore the backtrace behaviour on entry due to NMI should be like SysRq-g (no stack dump) rather than like oops. Note also that the stack dump does not offer any information that cannot be trivial retrieved using the 'bt' command. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Avoid printing KERN_ levels to consolesDaniel Thompson
Currently when kdb traps printk messages then the raw log level prefix (consisting of '\001' followed by a numeral) does not get stripped off before the message is issued to the various I/O handlers supported by kdb. This causes annoying visual noise as well as causing problems grepping for ^. It is also a change of behaviour compared to normal usage of printk() usage. For example <SysRq>-h ends up with different output to that of kdb's "sr h". This patch addresses the problem by stripping log levels from messages before they are issued to the I/O handlers. printk() which can also act as an i/o handler in some cases is special cased; if the caller provided a log level then the prefix will be preserved when sent to printk(). The addition of non-printable characters to the output of kdb commands is a regression, albeit and extremely elderly one, introduced by commit 04d2c8c83d0e ("printk: convert the format for KERN_<LEVEL> to a 2 byte pattern"). Note also that this patch does *not* restore the original behaviour from v3.5. Instead it makes printk() from within a kdb command display the message without any prefix (i.e. like printk() normally does). Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Joe Perches <joe@perches.com> Cc: stable@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: Fix off by one error in kdb_cpu()Jason Wessel
There was a follow on replacement patch against the prior "kgdb: Timeout if secondary CPUs ignore the roundup". See: https://lkml.org/lkml/2015/1/7/442 This patch is the delta vs the patch that was committed upstream: * Fix an off-by-one error in kdb_cpu(). * Replace NR_CPUS with CONFIG_NR_CPUS to tell checkpatch that we really want a static limit. * Removed the "KGDB: " prefix from the pr_crit() in debug_core.c (kgdb-next contains a patch which introduced pr_fmt() to this file to the tag will now be applied automatically). Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19kdb: fix incorrect counts in KDB summary command outputJay Lan
The output of KDB 'summary' command should report MemTotal, MemFree and Buffers output in kB. Current codes report in unit of pages. A define of K(x) as is defined in the code, but not used. This patch would apply the define to convert the values to kB. Please include me on Cc on replies. I do not subscribe to linux-kernel. Signed-off-by: Jay Lan <jlan@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2015-02-19Merge branch 'kbuild' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - several cleanups in kbuild - serialize multiple *config targets so that 'make defconfig kvmconfig' works - The cc-ifversion macro got support for an else-branch * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kbuild,gcov: simplify kernel/gcov/Makefile more kbuild: allow cc-ifversion to have the argument for false condition kbuild,gcov: simplify kernel/gcov/Makefile kbuild,gcov: remove unnecessary workaround kbuild: do not add $(call ...) to invoke cc-version or cc-fullversion kbuild: fix cc-ifversion macro kbuild: drop $(version_h) from MRPROPER_FILES kbuild: use mixed-targets when two or more config targets are given kbuild: remove redundant line from bounds.h/asm-offsets.h kbuild: merge bounds.h and asm-offsets.h rules kbuild: Drop support for clean-rule
2015-02-18livepatch: simplify disable error pathJosh Poimboeuf
If registering the function with ftrace has previously succeeded, unregistering will almost never fail. Even if it does, it's not a fatal error. We can still carry on and disable the klp_func from being used by removing it from the klp_ops func stack. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-02-18Merge tag 'perf-core-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: User visible changes: - No need to explicitely enable evsels for workload started from perf, let it be enabled via perf_event_attr.enable_on_exec, removing some events that take place in the 'perf trace' before a workload is really started by it. (Arnaldo Carvalho de Melo) - Fix to handle optimized not-inlined functions in 'perf probe' (Masami Hiramatsu) - Update 'perf probe' man page (Masami Hiramatsu) - 'perf trace': Allow mixing with tracepoints and suppressing plain syscalls (Arnaldo Carvalho de Melo) Infrastructure changes: - Introduce {trace_seq_do,event_format_}_fprintf functions to allow a default tracepoint field list printer to be used in tools that allows redirecting output to a file. (Arnaldo Carvalho de Melo) - The man page for pthread_attr_set_affinity_np states that _GNU_SOURCE must be defined before pthread.h, do it to fix the build in some systems (Josh Boyer) - Cleanups in 'perf buildid-cache' (Masami Hiramatsu) - Fix dso cache test case (Namhyung Kim) - Do Not rely on dso__data_read_offset() to open DSO (Namhyung Kim) - Make perf aware of tracefs (Steven Rostedt). - Fix build by defining STT_GNU_IFUNC for glibc 2.9 and older (Vinson Lee) - AArch64 symbol resolution fixes (Victor Kamensky) - Kconfig beachhead (Jiri Olsa) - Simplify nr_pages validity (Kaixu Xia) - Fixup header positioning in 'perf list' (Yunlong Song) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/rt/nohz: Stop scheduler tick if running realtime taskRik van Riel
If the CPU is running a realtime task that does not round-robin with another realtime task of equal priority, there is no point in keeping the scheduler tick going. After all, whenever the scheduler tick runs, the kernel will just decide not to reschedule. Extend sched_can_stop_tick() to recognize these situations, and inform the rest of the kernel that the scheduler tick can be stopped. Tested-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: fweisbec@redhat.com Cc: mtosatti@redhat.com Link: http://lkml.kernel.org/r/20150216152349.6a8ed824@annuminas.surriel.com [ Small cleanliness tweak. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull RCU fix from Paul E. McKenney. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Simplify the branch stack checkYan, Zheng
Use event->attr.branch_sample_type to replace intel_pmu_needs_lbr_smpl() for avoiding duplicated code that implicitly enables the LBR. Currently, branch stack can be enabled by user explicitly requesting branch sampling or implicit branch sampling to correct PEBS skid. For user explicitly requested branch sampling, the branch_sample_type is explicitly set by user. For PEBS case, the branch_sample_type is also implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-11-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Always switch pmu specific data during context switchYan, Zheng
If two tasks were both forked from the same parent task, Events in their perf task contexts can be the same. Perf core may leave out switching the perf event contexts. Previous patch inroduces pmu specific data. The data is for saving the LBR stack, it is task specific. So we need to switch the data even when context switch is optimized out. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-7-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Add pmu specific data for perf task contextYan, Zheng
Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach stata. The flag is set by PMU's event_init() callback, it indicates that perf event needs PMU specific data. The PMU specific data are initialized to zeros. Later patches will use PMU specific data to save LBR stack. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-6-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf/x86/intel: Use context switch callback to flush LBR stackYan, Zheng
Previous commit introduces context switch callback, its function overlaps with the flush branch stack callback. So we can use the context switch callback to flush LBR stack. This patch adds code that uses the flush branch callback to flush the LBR stack when task is being scheduled in. The callback is enabled only when there are events use the LBR hardware. This patch also removes all old flush branch stack code. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-4-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Introduce pmu context switch callbackYan, Zheng
The callback is invoked when process is scheduled in or out. It provides mechanism for later patches to save/store the LBR stack. For the schedule in case, the callback is invoked at the same place that flush branch stack callback is invoked. So it also can replace the flush branch stack callback. To avoid unnecessary overhead, the callback is enabled only when there are events use the LBR stack. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-3-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Update userspace page info for software eventShaohua Li
For hardware events, the userspace page of the event gets updated in context switches, so if we read the timestamp in the page, we get fresh info. For software events, this is missing currently. This patch makes the behavior consistent. With this patch, we can implement clock_gettime(THREAD_CPUTIME) with PERF_COUNT_SW_DUMMY in userspace as suggested by Andy and Peter. Code like this: if (pc->cap_user_time) { do { seq = pc->lock; barrier(); running = pc->time_running; cyc = rdtsc(); time_mult = pc->time_mult; time_shift = pc->time_shift; time_offset = pc->time_offset; barrier(); } while (pc->lock != seq); quot = (cyc >> time_shift); rem = cyc & ((1 << time_shift) - 1); delta = time_offset + quot * time_mult + ((rem * time_mult) >> time_shift); running += delta; return running; } I tried it on a busy system, the userspace page updating doesn't have noticeable overhead. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/aa2dd2e4f1e9f2225758be5ba00f14d6909a8ce1.1423180257.git.shli@fb.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18perf: Update shadow timestamp before add eventShaohua Li
Update the shadow timestamp before start event, because .add might use the timestamp. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Link: http://lkml.kernel.org/r/9cd0276d6a047cb7c2885994f25e3a1f7c8c28af.1423180257.git.shli@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/rwsem: Check for active lock before bailing on spinningDavidlohr Bueso
37e9562453b ("locking/rwsem: Allow conservative optimistic spinning when readers have lock") forced the default for optimistic spinning to be disabled if the lock owner was nil, which makes much sense for readers. However, while it is not our priority, we can make some optimizations for write-mostly workloads. We can bail the spinning step and still be conservative if there are any active tasks, otherwise there's really no reason not to spin, as the semaphore is most likely unlocked. This patch recovers most of a Unixbench 'execl' benchmark throughput by sleeping less and making better average system usage: before: CPU %user %nice %system %iowait %steal %idle all 0.60 0.00 8.02 0.00 0.00 91.38 after: CPU %user %nice %system %iowait %steal %idle all 1.22 0.00 70.18 0.00 0.00 28.60 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1422609267-15102-6-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/rwsem: Avoid deceiving lock spinnersDavidlohr Bueso
When readers hold the semaphore, the ->owner is nil. As such, and unlike mutexes, '!owner' does not necessarily imply that the lock is free. This will cause writers to potentially spin excessively as they've been mislead to thinking they have a chance of acquiring the lock, instead of blocking. This patch therefore enhances the counter check when the owner is not set by the time we've broken out of the loop. Otherwise we can return true as a new owner has the lock and thus we want to continue spinning. While at it, we can make rwsem_spin_on_owner() less ambiguos and return right away under need_resched conditions. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1422609267-15102-5-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/rwsem: Set lock ownership ASAPDavidlohr Bueso
In order to optimize the spinning step, we need to set the lock owner as soon as the lock is acquired; after a successful counter cmpxchg operation, that is. This is particularly useful as rwsems need to set the owner to nil for readers, so there is a greater chance of falling out of the spinning. Currently we only set the owner much later in the game, in the more generic level -- latency can be specially bad when waiting for a node->next pointer when releasing the osq in up_write calls. As such, update the owner inside rwsem_try_write_lock (when the lock is obtained after blocking) and rwsem_try_write_lock_unqueued (when the lock is obtained while spinning). This requires creating a new internal rwsem.h header to share the owner related calls. Also cleanup some headers for mutex and rwsem. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1422609267-15102-4-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/rwsem: Document barrier need when waking tasksDavidlohr Bueso
The need for the smp_mb() in __rwsem_do_wake() should be properly documented. Applies to both xadd and spinlock variants. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1422609267-15102-3-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreadsOleg Nesterov
attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and this looks doubly wrong: 1. It should actually check PF_KTHREAD, kthread can do use_mm(). 2. If this task is not kthread and it is actually the lock owner we can wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN. And note that this wrong EPERM is the likely case unless the exiting task is (auto)reaped quickly, we check ->mm before PF_EXITING. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mateusz Guzik <mguzik@redhat.com> Link: http://lkml.kernel.org/r/20150202140536.GA26406@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/mutex: Refactor mutex_spin_on_owner()Jason Low
As suggested by Davidlohr, we could refactor mutex_spin_on_owner(). Currently, we split up owner_running() with mutex_spin_on_owner(). When the owner changes, we make duplicate owner checks which are not necessary. It also makes the code a bit obscure as we are using a second check to figure out why we broke out of the loop. This patch modifies it such that we remove the owner_running() function and the mutex_spin_on_owner() loop directly checks for if the owner changes, if the owner is not running, or if we need to reschedule. If the owner changes, we break out of the loop and return true. If the owner is not running or if we need to reschedule, then break out of the loop and return false. Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: chegu_vinod@hp.com Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/1422914367-5574-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18locking/mutex: In mutex_spin_on_owner(), return true when owner changesJason Low
In the mutex_spin_on_owner(), we return true only if lock->owner == NULL. This was beneficial in situations where there were multiple threads simultaneously spinning for the mutex. If another thread got the lock while other spinner(s) were also doing mutex_spin_on_owner(), then the other spinners would stop spinning. This workaround helped reduce the chance that many spinners were simultaneously spinning for the mutex which can help reduce contention in highly contended cases. However, recent changes were made to the optimistic spinning code such that instead of having all spinners simultaneously spin for the mutex, we queue the spinners with an MCS lock such that only one thread spins for the mutex at a time. Furthermore, the OSQ optimizations ensure that spinners in the queue will stop waiting if it needs to reschedule. Now, we don't have to worry about multiple threads spinning on owner at the same time, and if lock->owner is not NULL at this point, it likely means another thread happens to obtain the lock in the fastpath. In this case, it would make sense for the spinner to continue spinning as long as the spinner doesn't need to schedule and the mutex owner is running. This patch changes this so that mutex_spin_on_owner() returns true when the lock owner changes, which means a thread will only stop spinning if it either needs to reschedule or if the lock owner is not running. We saw up to a 5% performance improvement in the fserver workload with this patch. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: chegu_vinod@hp.com Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/1422914367-5574-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/numa: Avoid some pointless iterationsJan Beulich
Commit 81907478c431 ("sched/fair: Avoid using uninitialized variable in preferred_group_nid()") unconditionally initializes max_group with NODE_MASK_NONE, this means that when !max_faults (max_group didn't get set), we'll now continue the iteration with an empty mask. Which in turn makes the actual body of the loop go away, so we'll just iterate until completion; short circuit this by breaking out of the loop as soon as this would happen. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150209113727.GS5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/numa: Do not move past the balance point if unbalancedRik van Riel
There is a subtle interaction between the logic introduced in commit e63da03639cc ("sched/numa: Allow task switch if load imbalance improves"), the way the load balancer counts the load on each NUMA node, and the way NUMA hinting faults are done. Specifically, the load balancer only counts currently running tasks in the load, while NUMA hinting faults may cause tasks to stop, if the page is locked by another task. This could cause all of the threads of a large single instance workload, like SPECjbb2005, to migrate to the same NUMA node. This was possible because occasionally they all fault on the same few pages, and only one of the threads remains runnable. That thread can move to the process's preferred NUMA node without making the imbalance worse, because nothing else is running at that time. The fix is to check the direction of the net moving of load, and to refuse a NUMA move if it would cause the system to move past the point of balance. In an unbalanced state, only moves that bring us closer to the balance point are allowed. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/20150203165648.0e9ac692@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/rt: Avoid obvious configuration failPeter Zijlstra
Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it would disallow the kernel creating RT tasks. One can of course still set it to 1, which will (likely) still wreck your kernel, but at least make it clear that setting it to 0 is not good. Collect both sanity checks into the one place while we're there. Suggested-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/autogroup: Fix failure to set cpu.rt_runtime_usPeter Zijlstra
Because task_group() uses a cache of autogroup_task_group(), whose output depends on sched_class, switching classes can generate problems. In particular, when started as fair, the cache points to the autogroup, so when switching to RT the tg_rt_schedulable() test fails for every cpu.rt_{runtime,period}_us change because now the autogroup has tasks and no runtime. Furthermore, going back to the previous semantics of varying task_group() with sched_class has the down-side that the sched_debug output varies as well, even though the task really is in the autogroup. Therefore add an autogroup exception to tg_has_rt_tasks() -- such that both (all) task_group() usages in sched/core now have one. And remove all the remnants of the variable task_group() output. Reported-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Stefan Bader <stefan.bader@canonical.com> Fixes: 8323f26ce342 ("sched: Fix race in task_group()") Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/dl: Do update_rq_clock() in yield_task_dl()Kirill Tkhai
update_curr_dl() needs actual rq clock. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18clockevents: Introduce mode specific callbacksViresh Kumar
It is not possible for the clockevents core to know which modes (other than those with a corresponding feature flag) are supported by a particular implementation. And drivers are expected to handle transition to all modes elegantly, as ->set_mode() would be issued for them unconditionally. Now, adding support for a new mode complicates things a bit if we want to use the legacy ->set_mode() callback. We need to closely review all clockevents drivers to see if they would break on addition of a new mode. And after such reviews, it is found that we have to do non-trivial changes to most of the drivers [1]. Introduce mode-specific set_mode_*() callbacks, some of which the drivers may or may not implement. A missing callback would clearly convey the message that the corresponding mode isn't supported. A driver may still choose to keep supporting the legacy ->set_mode() callback, but ->set_mode() wouldn't be supporting any new modes beyond RESUME. If a driver wants to benefit from using a new mode, it would be required to migrate to the mode specific callbacks. The legacy ->set_mode() callback and the newly introduced mode-specific callbacks are mutually exclusive. Only one of them should be supported by the driver. Sanity check is done at the time of registration to distinguish between optional and required callbacks and to make error recovery and handling simpler. If the legacy ->set_mode() callback is provided, all mode specific ones would be ignored by the core but a warning is thrown if they are present. Call sites calling ->set_mode() directly are also updated to use __clockevents_set_mode() instead, as ->set_mode() may not be available anymore for few drivers. [1] https://lkml.org/lkml/2014/12/9/605 [2] https://lkml.org/lkml/2015/1/23/255 Suggested-by: Thomas Gleixner <tglx@linutronix.de> [2] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Link: http://lkml.kernel.org/r/792d59a40423f0acffc9bb0bec9de1341a06fa02.1423788565.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18genirq: Provide disable_hardirq()Peter Zijlstra
For things like netpoll there is a need to disable an interrupt from atomic context. Currently netpoll uses disable_irq() which will sleep-wait on threaded handlers and thus forced_irqthreads breaks things. Provide disable_hardirq(), which uses synchronize_hardirq() to only wait for active hardirq handlers; also change synchronize_hardirq() to return the status of threaded handlers. This will allow one to try-disable an interrupt from atomic context, or in case of request_threaded_irq() to only wait for the hardirq part. Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Miller <davem@davemloft.net> Cc: Eyal Perry <eyalpe@mellanox.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Quentin Lambert <lambert.quentin@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Link: http://lkml.kernel.org/r/20150205130623.GH5029@twins.programming.kicks-ass.net [ Fixed typos and such. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18ntp: Fixup adjtimex freq validation on 32-bit systemsJohn Stultz
Additional validation of adjtimex freq values to avoid potential multiplication overflows were added in commit 5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values) Unfortunately the patch used LONG_MAX/MIN instead of LLONG_MAX/MIN, which was fine on 64-bit systems, but being much smaller on 32-bit systems caused false positives resulting in most direct frequency adjustments to fail w/ EINVAL. ntpd only does direct frequency adjustments at startup, so the issue was not as easily observed there, but other time sync applications like ptpd and chrony were more effected by the bug. See bugs: https://bugzilla.kernel.org/show_bug.cgi?id=92481 https://bugzilla.redhat.com/show_bug.cgi?id=1188074 This patch changes the checks to use LLONG_MAX for clarity, and additionally the checks are disabled on 32-bit systems since LLONG_MAX/PPM_SCALE is always larger then the 32-bit long freq value, so multiplication overflows aren't possible there. Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: George Joseph <george.joseph@fairview5.com> Tested-by: George Joseph <george.joseph@fairview5.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.19+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org [ Prettified the changelog and the comments a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched: Prevent recursion in io_schedule()NeilBrown
io_schedule() calls blk_flush_plug() which, depending on the contents of current->plug, can initiate arbitrary blk-io requests. Note that this contrasts with blk_schedule_flush_plug() which requires all non-trivial work to be handed off to a separate thread. This makes it possible for io_schedule() to recurse, and initiating block requests could possibly call mempool_alloc() which, in times of memory pressure, uses io_schedule(). Apart from any stack usage issues, io_schedule() will not behave correctly when called recursively as delayacct_blkio_start() does not allow for repeated calls. So: - use ->in_iowait to detect recursion. Set it earlier, and restore it to the old value. - move the call to "raw_rq" after the call to blk_flush_plug(). As this is some sort of per-cpu thing, we want some chance that we are on the right CPU - When io_schedule() is called recurively, use blk_schedule_flush_plug() which cannot further recurse. - as this makes io_schedule() a lot more complex and as io_schedule() must match io_schedule_timeout(), but all the changes in io_schedule_timeout() and make io_schedule a simple wrapper for that. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Moved the now rudimentary io_schedule() into sched.h. ] Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tony Battersby <tonyb@cybernetics.com> Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/completion: Serialize completion_done() with complete()Oleg Nesterov
Commit de30ec47302c "Remove unnecessary ->wait.lock serialization when reading completion state" was not correct, without lock/unlock the code like stop_machine_from_inactive_cpu() while (!completion_done()) cpu_relax(); can return before complete() finishes its spin_unlock() which writes to this memory. And spin_unlock_wait(). While at it, change try_wait_for_completion() to use READ_ONCE(). Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Added a comment with the barrier. ] Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: raghavendra.kt@linux.vnet.ibm.com Cc: waiman.long@hp.com Fixes: de30ec47302c ("sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state") Link: http://lkml.kernel.org/r/20150212195913.GA30430@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched: Fix preempt_schedule_common() triggering tracing recursionFrederic Weisbecker
Since the function graph tracer needs to disable preemption, it might call preempt_schedule() after reenabling it if something triggered the need for rescheduling in between. Therefore we can't trace preempt_schedule() itself because we would face a function tracing recursion otherwise as the tracer is always called before PREEMPT_ACTIVE gets set to prevent that recursion. This is why preempt_schedule() is tagged as "notrace". But the same issue applies to every function called by preempt_schedule() before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is one such example. Unfortunately we forgot to tag it as notrace as well and as a result we are encountering tracing recursion since it got introduced by: a18b5d0181923 ("sched: Fix missing preemption opportunity") Let's fix that by applying the appropriate function tag to preempt_schedule_common(). Reported-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()Kirill Tkhai
A deadline task may be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; Later the timer fires, but the task is still dequeued: dl_task_timer() enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */ Someone wakes it up: try_to_wake_up() enqueue_dl_entity() BUG_ON(on_dl_rq()) Patch fixes this problem, it prevents queueing !on_rq tasks on dl_rq. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Wrote comment. ] Cc: Juri Lelli <juri.lelli@arm.com> Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state") Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18sched: Make dl_task_time() use task_rq_lock()Peter Zijlstra
Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>