summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-01-09perf: Move task_pt_regs sampling into arch codeAndy Lutomirski
On x86_64, at least, task_pt_regs may be only partially initialized in many contexts, so x86_64 should not use it without extra care from interrupt context, let alone NMI context. This will allow x86_64 to override the logic and will supply some scratch space to use to make a cleaner copy of user regs. Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09livepatch: handle ancient compilers with more graceJiri Kosina
We are aborting a build in case when gcc doesn't support fentry on x86_64 (regs->ip modification can't really reliably work with mcount). This however breaks allmodconfig for people with older gccs that don't support -mfentry. Turn the build-time failure into runtime failure, resulting in the whole infrastructure not being initialized if CC_USING_FENTRY is unset. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
2015-01-08exit: fix race between wait_consider_task() and wait_task_zombie()Oleg Nesterov
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE change in between, gcc needs to reload p->exit_state after security_task_wait(). In this case ->notask_error will be wrongly cleared and do_wait() can hang forever if it was the last eligible child. Many thanks to Arne who carefully investigated the problem. Note: this bug is very old but it was pure theoretical until commit b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before this commit "-O2" was probably enough to guarantee that compiler won't read ->exit_state twice. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Arne Goedeke <el@laramies.com> Tested-by: Arne Goedeke <el@laramies.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-07time: adjtimex: Validate the ADJ_FREQUENCY valuesSasha Levin
Verify that the frequency value from userspace is valid and makes sense. Unverified values can cause overflows later on. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: Fix up bug for negative values and drop redunent cap check] Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-07time: settimeofday: Validate the values of tv from userSasha Levin
An unvalidated user input is multiplied by a constant, which can result in an undefined behaviour for large values. While this is validated later, we should avoid triggering undefined behaviour. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: include trivial milisecond->microsecond correction noticed by Andy] Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-01-06livepatch: kconfig: use bool instead of booleanChristoph Jaeger
Keyword 'boolean' for type definition attributes is considered deprecated and should not be used anymore. No functional changes. Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com Reference: http://lkml.kernel.org/r/1419108071-11607-1-git-send-email-cj@linux.com Signed-off-by: Christoph Jaeger <cj@linux.com> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-06rcu: Handle gpnum/completed wrap while dyntick idlePaul E. McKenney
Subtle race conditions can result if a CPU stays in dyntick-idle mode long enough for the ->gpnum and ->completed fields to wrap. For example, consider the following sequence of events: o CPU 1 encounters a quiescent state while waiting for grace period 5 to complete, but then enters dyntick-idle mode. o While CPU 1 is in dyntick-idle mode, the grace-period counters wrap around so that the grace period number is now 4. o Just as CPU 1 exits dyntick-idle mode, grace period 4 completes and grace period 5 begins. o The quiescent state that CPU 1 passed through during the old grace period 5 looks like it applies to the new grace period 5. Therefore, the new grace period 5 completes without CPU 1 having passed through a quiescent state. This could clearly be a fatal surprise to any long-running RCU read-side critical section that happened to be running on CPU 1 at the time. At one time, this was not a problem, given that it takes significant time for the grace-period counters to overflow even on 32-bit systems. However, with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long idle periods are now becoming quite feasible. It is therefore time to close this race. This commit therefore avoids this race condition by having the quiescent-state forcing code detect when a CPU is falling too far behind, and setting a new rcu_data field ->gpwrap when this happens. Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed fields are known to be untrustworthy, and can be ignored, along with any associated quiescent states. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Improve diagnostics for spurious RCU CPU stall warningsPaul E. McKenney
The current RCU CPU stall warning code will print "Stall ended before state dump start" any time that the stall-warning code is triggered on a CPU that has already reported a quiescent state for the current grace period and if all quiescent states have been reported for the current grace period. However, a true stall can result in these symptoms, for example, by preventing RCU's grace-period kthreads from ever running This commit therefore checks for this condition, reporting the end of the stall only if one of the grace-period counters has actually advanced. Otherwise, it reports the last time that the grace-period kthread made meaningful progress. (In normal situations, the grace-period kthread should make meaningful progress at least every jiffies_till_next_fqs jiffies.) Reported-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Miroslav Benes <mbenes@suse.cz>
2015-01-06rcu: Make RCU_CPU_STALL_INFO include number of fqs attemptsPaul E. McKenney
One way that an RCU CPU stall warning can happen is if the grace-period kthread is not allowed to execute. One proxy for this kthread's forward progress is the number of force-quiescent-state (fqs) scans. This commit therefore adds the number of fqs scans to the RCU CPU stall warning printouts when CONFIG_RCU_CPU_STALL_INFO=y. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Make SRCU optional by using CONFIG_SRCUPranith Kumar
SRCU is not necessary to be compiled by default in all cases. For tinification efforts not compiling SRCU unless necessary is desirable. The current patch tries to make compiling SRCU optional by introducing a new Kconfig option CONFIG_SRCU which is selected when any of the components making use of SRCU are selected. If we do not select CONFIG_SRCU, srcu.o will not be compiled at all. text data bss dec hex filename 2007 0 0 2007 7d7 kernel/rcu/srcu.o Size of arch/powerpc/boot/zImage changes from text data bss dec hex filename 831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before 829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after so the savings are about ~2000 bytes. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Josh Triplett <josh@joshtriplett.org> CC: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06rcu: Expand SRCU ->completed to 64 bitsPaul E. McKenney
When rcutorture used only the low-order 32 bits of the grace-period number, it was not a problem for SRCU to use a 32-bit completed field. However, rcutorture now uses the full 64 bits on 64-bit systems, so this commit converts SRCU's ->completed field to unsigned long so as to provide 64 bits on 64-bit systems. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Remove redundant callback-list initializationPaul E. McKenney
The RCU callback lists are initialized in both rcu_boot_init_percpu_data() and rcu_init_percpu_data(). The former is intended for initializing immutable data, so this commit removes the initialization from rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data(). This change prepares for permitting callbacks to be queued very early in boot. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Don't scan root rcu_node structure for stalled tasksPaul E. McKenney
Now that blocked tasks are no longer migrated to the root rcu_node structure, there is no need to scan the root rcu_node structure for blocked tasks stalling the current grace period. This commit therefore removes this scan. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid ↵Lai Jiangshan
priority-inversion The patch dfeb9765ce3c ("Allow post-unlock reference for rt_mutex") ensured rcu-boost safe even the rt_mutex has post-unlock reference. But rt_mutex allowing post-unlock reference is definitely a bug and it was fixed by the commit 27e35715df54 ("rtmutex: Plug slow unlock race"). This fix made the previous patch (dfeb9765ce3c) useless. And even worse, the priority-inversion introduced by the the previous patch still exists. rcu_read_unlock_special() { rt_mutex_unlock(&rnp->boost_mtx); /* Priority-Inversion: * the current task had been deboosted and preempted as a low * priority task immediately, it could wait long before reschedule in, * and the rcu-booster also waits on this low priority task and sleeps. * This priority-inversion makes rcu-booster can't work * as expected. */ complete(&rnp->boost_completion); } Just revert the patch to avoid it. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Note quiescent state when CPU goes offlinePaul E. McKenney
The rcu_cleanup_dead_cpu() function (called after a CPU has gone completely offline) has not reported a quiescent state because there was probably at least one synchronize_rcu() between the time the CPU went offline and the CPU_DEAD notifier, and this would have detected the CPU's offline state via quiescent-state forcing. However, the plan is for CPUs to take themselves offline, at which point it makes sense for them to report their own quiescent state. This commit makes this change in preparation for the new CPU-hotplug setup. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Don't bother affinitying rcub kthreads away from offline CPUsPaul E. McKenney
When rcu_boost_kthread_setaffinity() sees that all CPUs for a given rcu_node structure are now offline, it affinities the corresponding RCU-boost ("rcub") kthread away from those CPUs. This is pointless because the kthread cannot run on those offline CPUs in any case. This commit therefore removes this unneeded code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Don't initiate RCU priority boosting on root rcu_nodePaul E. McKenney
Because there is no longer any preempted tasks on the root rcu_node, and because there is no longer ever an rcub kthread for the root rcu_node, this commit drops the code in force_qs_rnp() that attempts to awaken the non-existent root rcub kthread. This is strictly a performance enhancement, removing a root rcu_node ->lock acquisition and release along with some tests in rcu_initiate_boost(), ending with the test that notes that there is no rcub kthread. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Don't spawn rcub kthreads on root rcu_node structurePaul E. McKenney
Now that offlining CPUs no longer moves leaf rcu_node structures' ->blkd_tasks lists to the root, there is no way for the root rcu_node structure's ->blkd_task list to be nonempty, unless the root node is also the sole leaf node. This commit therefore refrains from creating an rcub kthread for the root rcu_node structure unless it is also the sole leaf. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Make use of rcu_preempt_has_tasks()Paul E. McKenney
Given that there is now arcu_preempt_has_tasks() function that checks to see if the ->blkd_tasks list is non-empty, this commit makes use of it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Shorten irq-disable region in rcu_cleanup_dead_cpu()Paul E. McKenney
Now that we are not migrating callbacks, there is no need to hold the ->orphan_lock across the the ->qsmaskinit bit-clearing process. This commit therefore releases ->orphan_lock immediately after adopting the orphaned RCU callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Don't migrate blocked tasks even if all corresponding CPUs offlinePaul E. McKenney
When the last CPU associated with a given leaf rcu_node structure goes offline, something must be done about the tasks queued on that rcu_node structure. Each of these tasks has been preempted on one of the leaf rcu_node structure's CPUs while in an RCU read-side critical section that it have not yet exited. Handling these tasks is the job of rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node structure to the root rcu_node structure. Unfortunately, this migration has to be done one task at a time because each tasks allegiance must be shifted from the original leaf rcu_node to the root, so that future attempts to deal with these tasks will acquire the root rcu_node structure's ->lock rather than that of the leaf. Worse yet, this migration must be done with interrupts disabled, which is not so good for realtime response, especially given that there is no bound on the number of tasks on a given rcu_node structure's list. (OK, OK, there is a bound, it is just that it is unreasonably large, especially on 64-bit systems.) This was not considered a problem back when rcu_preempt_offline_tasks() was first written because realtime systems were assumed not to do CPU-hotplug operations while real-time applications were running. This assumption has proved of dubious validity given that people are starting to run multiple realtime applications on a single SMP system and that it is common practice to offline then online a CPU before starting its real-time application in order to clear extraneous processing off of that CPU. So we now need CPU hotplug operations to avoid undue latencies. This commit therefore avoids migrating these tasks, instead letting them be dequeued one by one from the original leaf rcu_node structure by rcu_read_unlock_special(). This means that the clearing of bits from the upper-level rcu_node structures must be deferred until the last such task has been dequeued, because otherwise subsequent grace periods won't wait on them. This commit has the beneficial side effect of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in CONFIG_RCU_BOOST builds. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearingPaul E. McKenney
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit bit clearing up the rcu_node tree once a given rcu_node structure's blkd_tasks list becomes empty. This is the final commit in preparation for the rework of RCU priority boosting: It enables preempted tasks to remain queued on their rcu_node structure even after all of that rcu_node structure's CPUs have gone offline. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()Paul E. McKenney
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu() in preparation for the rework of RCU priority boosting. This new function will be invoked from rcu_read_unlock_special() in the reworked scheme, which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node structure's ->qsmaskinit field has already been updated. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Rename "empty" to "empty_norm" in preparation for boost reworkPaul E. McKenney
This commit undertakes a simple variable renaming to make way for some rework of RCU priority boosting. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Protect rcu_boost() lockless accesses with ACCESS_ONCE()Paul E. McKenney
This commit prevents random compiler optimizations by applying ACCESS_ONCE() to lockless accesses. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Remove "select IRQ_WORK" from config TREE_RCULai Jiangshan
The 48a7639ce80c ("rcu: Make callers awaken grace-period kthread") removed the irq_work_queue(), so the TREE_RCU doesn't need irq work any more. This commit therefore updates RCU's Kconfig and Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06rcu: Fix rcu_barrier() race that could result in too-short waitPaul E. McKenney
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions. It checks a given CPU's lists of callbacks, and if all three no-CBs lists are empty, ignores that CPU. However, these three lists could potentially be empty even when callbacks are present if the check executed just as the callbacks were being moved from one list to another. It turns out that recent versions of rcutorture can spot this race. This commit plugs this hole by consolidating the per-list counts of no-CBs callbacks into a single count, which is incremented before the corresponding callback is posted and after it is invoked. Then rcu_barrier() checks this single count to reliably determine whether the corresponding CPU has no-CBs callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06hotplugcpu: Avoid deadlocks by waking active_writerDavid Hildenbrand
Commit b2c4623dcd07 ("rcu: More on deadlock between CPU hotplug and expedited grace periods") introduced another problem that can easily be reproduced by starting/stopping cpus in a loop. E.g.: for i in `seq 5000`; do echo 1 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu1/online done Will result in: INFO: task /cpu_start_stop:1 blocked for more than 120 seconds. Call Trace: ([<00000000006a028e>] __schedule+0x406/0x91c) [<0000000000130f60>] cpu_hotplug_begin+0xd0/0xd4 [<0000000000130ff6>] _cpu_up+0x3e/0x1c4 [<0000000000131232>] cpu_up+0xb6/0xd4 [<00000000004a5720>] device_online+0x80/0xc0 [<00000000004a57f0>] online_store+0x90/0xb0 ... And a deadlock. Problem is that if the last ref in put_online_cpus() can't get the cpu_hotplug.lock the puts_pending count is incremented, but a sleeping active_writer might never be woken up, therefore never exiting the loop in cpu_hotplug_begin(). This fix removes puts_pending and turns refcount into an atomic variable. We also introduce a wait queue for the active_writer, to avoid possible races and use-after-free. There is no need to take the lock in put_online_cpus() anymore. Can't reproduce it with this fix. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_taskLai Jiangshan
For RCU in UP, context-switch = QS = GP, thus we can force a context-switch when any call_rcu_[bh|sched]() is happened on idle_task. After doing so, rcu_idle/irq_enter/exit() are useless, so we can simply make these functions empty. More important, this change does not change the functionality logically. Note: raise_softirq(RCU_SOFTIRQ)/rcu_sched_qs() in rcu_idle_enter() and outmost rcu_irq_exit() will have to wake up the ksoftirqd (due to in_interrupt() == 0). Before this patch After this patch: call_rcu_sched() in idle; call_rcu_sched() in idle set resched do other stuffs; do other stuffs outmost rcu_irq_exit() outmost rcu_irq_exit() (empty function) (or rcu_idle_enter()) (or rcu_idle_enter(), also empty function) start to resched. (see above) rcu_sched_qs() rcu_sched_qs() QS,and GP and advance cb QS,and GP and advance cb wake up the ksoftirqd wake up the ksoftirqd set resched resched to ksoftirqd (or other) resched to ksoftirqd (or other) These two code patches are almost the same. Size changed after patched: size kernel/rcu/tiny-old.o kernel/rcu/tiny-patched.o text data bss dec hex filename 3449 206 8 3663 e4f kernel/rcu/tiny-old.o 2406 144 8 2558 9fe kernel/rcu/tiny-patched.o Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-03spinlock: Add spin_lock_bh_nested()Thomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-31Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fix from Paul Moore: "One audit patch to resolve a panic/oops when recording filenames in the audit log, see the mail archive link below. The fix isn't as nice as I would like, as it involves an allocate/copy of the filename, but it solves the problem and the overhead should only affect users who have configured audit rules involving file names. We'll revisit this issue with future kernels in an attempt to make this suck less, but in the meantime I think this fix should go into the next release of v3.19-rcX. [ https://marc.info/?t=141986927600001&r=1&w=2 ]" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: audit: create private file name copies when auditing inodes
2014-12-30rcu: Fix invoke_rcu_callbacks() commentPaul E. McKenney
Despite what the comment says, it is only softirqs that are disabled, not interrupts. This commit therefore fixes the comment. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-12-30rcu: Remove redundant rcu_is_cpu_rrupt_from_idle() from tiny RCUAlexander Gordeev
Let's start assuming that something in the idle loop posts a callback, and scheduling-clock interrupt occurs: 1. The system is idle and stays that way, no runnable tasks. 2. Scheduling-clock interrupt occurs, rcu_check_callbacks() is called as result, which in turn calls rcu_is_cpu_rrupt_from_idle(). 3. rcu_is_cpu_rrupt_from_idle() reports the CPU was interrupted from idle, which results in rcu_sched_qs() call, which does a raise_softirq(RCU_SOFTIRQ). 4. Upon return from interrupt, rcu_irq_exit() is invoked, which calls rcu_idle_enter_common(), which in turn calls rcu_sched_qs() again, which does another raise_softirq(RCU_SOFTIRQ). 5. The softirq happens shortly and invokes rcu_process_callbacks(), which invokes __rcu_process_callbacks(). 6. So now callbacks can be invoked. At least they can be if ->donetail has been updated. Which it will have been because rcu_sched_qs() invokes rcu_qsctr_help(). In the described scenario rcu_sched_qs() and raise_softirq(RCU_SOFTIRQ) get called twice in steps 3 and 4. This redundancy could be eliminated by removing rcu_is_cpu_rrupt_from_idle() function. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-12-30rcu: Make rcu_nmi_enter() handle nestingPaul E. McKenney
The x86 architecture has multiple types of NMI-like interrupts: real NMIs, machine checks, and, for some values of NMI-like, debugging and breakpoint interrupts. These interrupts can nest inside each other. Andy Lutomirski is adding RCU support to these interrupts, so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting. This commit therefore introduces nesting, using a clever NMI-coordination algorithm suggested by Andy. The trick is to atomically increment ->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry (and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting is incremented by one if ->dynticks was incremented and by two otherwise. This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal to one, it knows that ->dynticks must be atomically incremented. This NMI-coordination algorithms has been validated by the following Promela model: ------------------------------------------------------------------------ /* * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter() * that allows nesting. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, you can access it online at * http://www.gnu.org/licenses/gpl-2.0.html. * * Copyright IBM Corporation, 2014 * * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> */ byte dynticks_nmi_nesting = 0; byte dynticks = 0; /* * Promela verision of rcu_nmi_enter(). */ inline rcu_nmi_enter() { byte incby; byte tmp; incby = BUSY_INCBY; assert(dynticks_nmi_nesting >= 0); if :: (dynticks & 1) == 0 -> atomic { dynticks = dynticks + 1; } assert((dynticks & 1) == 1); incby = 1; :: else -> skip; fi; tmp = dynticks_nmi_nesting; tmp = tmp + incby; dynticks_nmi_nesting = tmp; assert(dynticks_nmi_nesting >= 1); } /* * Promela verision of rcu_nmi_exit(). */ inline rcu_nmi_exit() { byte tmp; assert(dynticks_nmi_nesting > 0); assert((dynticks & 1) != 0); if :: dynticks_nmi_nesting != 1 -> tmp = dynticks_nmi_nesting; tmp = tmp - BUSY_INCBY; dynticks_nmi_nesting = tmp; :: else -> dynticks_nmi_nesting = 0; atomic { dynticks = dynticks + 1; } assert((dynticks & 1) == 0); fi; } /* * Base-level NMI runs non-atomically. Crudely emulates process-level * dynticks-idle entry/exit. */ proctype base_NMI() { byte busy; busy = 0; do :: /* Emulate base-level dynticks and not. */ if :: 1 -> atomic { dynticks = dynticks + 1; } busy = 1; :: 1 -> skip; fi; /* Verify that we only sometimes have base-level dynticks. */ if :: busy == 0 -> skip; :: busy == 1 -> skip; fi; /* Model RCU's NMI entry and exit actions. */ rcu_nmi_enter(); assert((dynticks & 1) == 1); rcu_nmi_exit(); /* Emulated re-entering base-level dynticks and not. */ if :: !busy -> skip; :: busy -> atomic { dynticks = dynticks + 1; } busy = 0; fi; /* We had better now be in dyntick-idle mode. */ assert((dynticks & 1) == 0); od; } /* * Nested NMI runs atomically to emulate interrupting base_level(). */ proctype nested_NMI() { do :: /* * Use an atomic section to model a nested NMI. This is * guaranteed to interleave into base_NMI() between a pair * of base_NMI() statements, just as a nested NMI would. */ atomic { /* Verify that we only sometimes are in dynticks. */ if :: (dynticks & 1) == 0 -> skip; :: (dynticks & 1) == 1 -> skip; fi; /* Model RCU's NMI entry and exit actions. */ rcu_nmi_enter(); assert((dynticks & 1) == 1); rcu_nmi_exit(); } od; } init { run base_NMI(); run nested_NMI(); } ------------------------------------------------------------------------ The following script can be used to run this model if placed in rcu_nmi.spin: ------------------------------------------------------------------------ if ! spin -a rcu_nmi.spin then echo Spin errors!!! exit 1 fi if ! cc -DSAFETY -o pan pan.c then echo Compilation errors!!! exit 1 fi ./pan -m100000 Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-12-30timecounter: keep track of accumulated fractional nanosecondsRichard Cochran
The current timecounter implementation will drop a variable amount of resolution, depending on the magnitude of the time delta. In other words, reading the clock too often or too close to a time stamp conversion will introduce errors into the time values. This patch fixes the issue by introducing a fractional nanosecond field that accumulates the low order bits. Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl> Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-30time: move the timecounter/cyclecounter code into its own file.Richard Cochran
The timecounter code has almost nothing to do with the clocksource code. Let it live in its own file. This will help isolate the timecounter users from the clocksource users in the source tree. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen. 2) Fix receive checksum handling in enic driver, from Govindarajulu Varadarajan. 3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from Herbert Xu. Also, add code to detect drivers that have this mistake in the future. 4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai. 5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP input path,f rom Nicolas Dichtel. 6) Fix MPLS action validation in openvswitch, from Pravin B Shelar. 7) Fix double SKB free in vxlan driver, also from Pravin. 8) When we scrub a packet, which happens when we are switching the context of the packet (namespace, etc.), we should reset the secmark. From Thomas Graf. 9) ->ndo_gso_check() needs to do more than return true/false, it also has to allow the driver to clear netdev feature bits in order for the caller to be able to proceed properly. From Jesse Gross. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits) genetlink: A genl_bind() to an out-of-range multicast group should not WARN(). netlink/genetlink: pass network namespace to bind/unbind ne2k-pci: Add pci_disable_device in error handling bonding: change error message to debug message in __bond_release_one() genetlink: pass multicast bind/unbind to families netlink: call unbind when releasing socket netlink: update listeners directly when removing socket genetlink: pass only network namespace to genl_has_listeners() netlink: rename netlink_unbind() to netlink_undo_bind() net: Generalize ndo_gso_check to ndo_features_check net: incorrect use of init_completion fixup neigh: remove next ptr from struct neigh_table net: xilinx: Remove unnecessary temac_property in the driver net: phy: micrel: use generic config_init for KSZ8021/KSZ8031 net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding openvswitch: fix odd_ptr_err.cocci warnings Bluetooth: Fix accepting connections when not using mgmt Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR brcmfmac: Do not crash if platform data is not populated ipw2200: select CFG80211_WEXT ...
2014-12-30audit: create private file name copies when auditing inodesPaul Moore
Unfortunately, while commit 4a928436 ("audit: correctly record file names with different path name types") fixed a problem where we were not recording filenames, it created a new problem by attempting to use these file names after they had been freed. This patch resolves the issue by creating a copy of the filename which the audit subsystem frees after it is done with the string. At some point it would be nice to resolve this issue with refcounts, or something similar, instead of having to allocate/copy strings, but that is almost surely beyond the scope of a -rcX patch so we'll defer that for later. On the plus side, only audit users should be impacted by the string copying. Reported-by: Toralf Foerster <toralf.foerster@gmx.de> Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-27netlink/genetlink: pass network namespace to bind/unbindJohannes Berg
Netlink families can exist in multiple namespaces, and for the most part multicast subscriptions are per network namespace. Thus it only makes sense to have bind/unbind notifications per network namespace. To achieve this, pass the network namespace of a given client socket to the bind/unbind functions. Also do this in generic netlink, and there also make sure that any bind for multicast groups that only exist in init_net is rejected. This isn't really a problem if it is accepted since a client in a different namespace will never receive any notifications from such a group, but it can confuse the family if not rejected (it's also possible to silently (without telling the family) accept it, but it would also have to be ignored on unbind so families that take any kind of action on bind/unbind won't do unnecessary work for invalid clients like that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fixes from Paul Moore: "Four patches to fix various problems with the audit subsystem, all are fairly small and straightforward. One patch fixes a problem where we weren't using the correct gfp allocation flags (GFP_KERNEL regardless of context, oops), one patch fixes a problem with old userspace tools (this was broken for a while), one patch fixes a problem where we weren't recording pathnames correctly, and one fixes a problem with PID based filters. In general I don't think there is anything controversial with this patchset, and it fixes some rather unfortunate bugs; the allocation flag one can be particularly scary looking for users" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: audit: restore AUDIT_LOGINUID unset ABI audit: correctly record file names with different path name types audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb audit: don't attempt to lookup PIDs when changing PID filtering audit rules
2014-12-23audit: restore AUDIT_LOGINUID unset ABIRichard Guy Briggs
A regression was caused by commit 780a7654cee8: audit: Make testing for a valid loginuid explicit. (which in turn attempted to fix a regression caused by e1760bd) When audit_krule_to_data() fills in the rules to get a listing, there was a missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID. This broke userspace by not returning the same information that was sent and expected. The rule: auditctl -a exit,never -F auid=-1 gives: auditctl -l LIST_RULES: exit,never f24=0 syscall=all when it should give: LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all Tag it so that it is reported the same way it was set. Create a new private flags audit_krule field (pflags) to store it that won't interact with the public one from the API. Cc: stable@vger.kernel.org # v3.10-rc1+ Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-23sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocationAlex Thorlton
When allocating space for load_balance_mask, in sched_init, when CPUMASK_OFFSTACK is set, we've managed to spill over KMALLOC_MAX_SIZE on our 6144 core machine. The patch below breaks up the allocations so that they don't overflow the max alloc size. It also allocates the masks on the the node from which they'll most commonly be accessed, to minimize remote accesses on NUMA machines. Suggested-by: George Beshers <gbeshers@sgi.com> Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: George Beshers <gbeshers@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-22tracing: Remove taking of trace_types_lock in pipe filesSteven Rostedt (Red Hat)
Taking the global mutex "trace_types_lock" in the trace_pipe files causes a bottle neck as most the pipe files can be read per cpu and there's no reason to serialize them. The current_trace variable was given a ref count and it can not change when the ref count is not zero. Opening the trace_pipe files will up the ref count (and decremented on close), so that the lock no longer needs to be taken when accessing the current_trace variable. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-23param: initialize store function to NULL if not available.Rusty Russell
I rebased Kees' 'param: do not set store func without write perm' on top of my 'params: cleanup sysfs allocation'. However, my patch uses krealloc which doesn't zero memory, leaving .store unset. Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-12-22tracing: Add ref count to tracer for when they are being read by pipeSteven Rostedt (Red Hat)
When one of the trace pipe files are being read (by either the trace_pipe or trace_pipe_raw), do not allow the current_trace to change. By adding a ref count that is incremented when the pipe files are opened, will prevent the current_trace from being changed. This will allow for the removal of the global trace_types_lock from reading the pipe buffers (which is currently a bottle neck). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-22livepatch: use FTRACE_OPS_FL_IPMODIFYJosh Poimboeuf
Use the FTRACE_OPS_FL_IPMODIFY flag to prevent conflicts with other ftrace users who also modify regs->ip. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.cz> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22audit: correctly record file names with different path name typesPaul Moore
There is a problem with the audit system when multiple audit records are created for the same path, each with a different path name type. The root cause of the problem is in __audit_inode() when an exact match (both the path name and path name type) is not found for a path name record; the existing code creates a new path name record, but it never sets the path name in this record, leaving it NULL. This patch corrects this problem by assigning the path name to these newly created records. There are many ways to reproduce this problem, but one of the easiest is the following (assuming auditd is running): # mkdir /root/tmp/test # touch /root/tmp/test/567 # auditctl -a always,exit -F dir=/root/tmp/test # touch /root/tmp/test/567 Afterwards, or while the commands above are running, check the audit log and pay special attention to the PATH records. A faulty kernel will display something like the following for the file creation: type=SYSCALL msg=audit(1416957442.025:93): arch=c000003e syscall=2 success=yes exit=3 ... comm="touch" exe="/usr/bin/touch" type=CWD msg=audit(1416957442.025:93): cwd="/root/tmp" type=PATH msg=audit(1416957442.025:93): item=0 name="test/" inode=401409 ... nametype=PARENT type=PATH msg=audit(1416957442.025:93): item=1 name=(null) inode=393804 ... nametype=NORMAL type=PATH msg=audit(1416957442.025:93): item=2 name=(null) inode=393804 ... nametype=NORMAL While a patched kernel will show the following: type=SYSCALL msg=audit(1416955786.566:89): arch=c000003e syscall=2 success=yes exit=3 ... comm="touch" exe="/usr/bin/touch" type=CWD msg=audit(1416955786.566:89): cwd="/root/tmp" type=PATH msg=audit(1416955786.566:89): item=0 name="test/" inode=401409 ... nametype=PARENT type=PATH msg=audit(1416955786.566:89): item=1 name="test/567" inode=393804 ... nametype=NORMAL This issue was brought up by a number of people, but special credit should go to hujianyang@huawei.com for reporting the problem along with an explanation of the problem and a patch. While the original patch did have some problems (see the archive link below), it did demonstrate the problem and helped kickstart the fix presented here. * https://lkml.org/lkml/2014/9/5/66 Reported-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com>
2014-12-22livepatch: move x86 specific ftrace handler code to arch/x86Li Bin
The execution flow redirection related implemention in the livepatch ftrace handler is depended on the specific architecture. This patch introduces klp_arch_set_pc(like kgdb_arch_set_pc) interface to change the pt_regs. Signed-off-by: Li Bin <huawei.libin@huawei.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22livepatch: kernel: add support for live patchingSeth Jennings
This commit introduces code for the live patching core. It implements an ftrace-based mechanism and kernel interface for doing live patching of kernel and kernel module functions. It represents the greatest common functionality set between kpatch and kgraft and can accept patches built using either method. This first version does not implement any consistency mechanism that ensures that old and new code do not run together. In practice, ~90% of CVEs are safe to apply in this way, since they simply add a conditional check. However, any function change that can not execute safely with the old version of the function can _not_ be safely applied in this version. [ jkosina@suse.cz: due to the number of contributions that got folded into this original patch from Seth Jennings, add SUSE's copyright as well, as discussed via e-mail ] Signed-off-by: Seth Jennings <sjenning@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>