summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-01-25file->f_path.dentry is pinned down for as long as the file is open...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-25Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Hopefully the last round of fixes for 3.19 - regression fix for the LDT changes - regression fix for XEN interrupt handling caused by the APIC changes - regression fixes for the PAT changes - last minute fixes for new the MPX support - regression fix for 32bit UP - fix for a long standing relocation issue on 64bit tagged for stable - functional fix for the Hyper-V clocksource tagged for stable - downgrade of a pr_err which tends to confuse users Looks a bit on the large side, but almost half of it are valuable comments" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsc: Change Fast TSC calibration failed from error to info x86/apic: Re-enable PCI_MSI support for non-SMP X86_32 x86, mm: Change cachemode exports to non-gpl x86, tls: Interpret an all-zero struct user_desc as "no segment" x86, tls, ldt: Stop checking lm in LDT_empty x86, mpx: Strictly enforce empty prctl() args x86, mpx: Fix potential performance issue on unmaps x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels x86, hyperv: Mark the Hyper-V clocksource as being continuous x86: Don't rely on VMWare emulating PAT MSR correctly x86, irq: Properly tag virtualization entry in /proc/interrupts x86, boot: Skip relocs when load address unchanged x86/xen: Override ACPI IRQ management callback __acpi_unregister_gsi ACPI: pci: Do not clear pci_dev->irq in acpi_pci_irq_disable() x86/xen: Treat SCI interrupt as normal GSI interrupt
2015-01-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A set of small fixes: - regression fix for exynos_mct clocksource - trivial build fix for kona clocksource - functional one liner fix for the sh_tmu clocksource - two validation fixes to prevent (root only) data corruption in the kernel via settimeofday and adjtimex. Tagged for stable" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: adjtimex: Validate the ADJ_FREQUENCY values time: settimeofday: Validate the values of tv from user clocksource: sh_tmu: Set cpu_possible_mask to fix SMP broadcast clocksource: kona: fix __iomem annotation clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
2015-01-24hrtimer: Make __hrtimer_get_next_event() statickbuild test robot
kernel/time/hrtimer.c:444:9: sparse: symbol '__hrtimer_get_next_event' was not declared. Should it be static? Fixes: 9bc7491906b4 hrtimer: Prevent stale expiry time in hrtimer_interrupt() Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: kbuild-all@01.org Link: http://lkml.kernel.org/r/20150123121206.GA4766@snb Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-24Merge tag 'fortglx-3.20-time' of ↵Thomas Gleixner
https://git.linaro.org/people/john.stultz/linux into timers/core Pull time updates from John Stultz for 3.20: * ktime division optimization * Expose a few more y2038-safe timekeeping interfaces * RTC core changes to address y2038
2015-01-23rtc: Convert rtc_set_ntp_time() to use timespec64Xunlei Pang
rtc_set_ntp_time() uses timespec which is y2038-unsafe, so modify to use timespec64 which is y2038-safe, then replace rtc_time_to_tm() with rtc_time64_to_tm(). Also adjust all its call sites(only NTP uses it) accordingly. Cc: pang.xunlei <pang.xunlei@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23time: Expose getboottime64 for in-kernel usesJohn Stultz
Adds a timespec64 based getboottime64() implementation that can be used as we convert internal users of getboottime away from using timespecs. Cc: pang.xunlei <pang.xunlei@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23ktime: Optimize ktime_divns for constant divisorsNicolas Pitre
At least on ARM, do_div() is optimized to turn constant divisors into an inline multiplication by the reciprocal value at compile time. However this optimization is missed entirely whenever ktime_divns() is used and the slow out-of-line division code is used all the time. Let ktime_divns() use do_div() inline whenever the divisor is constant and small enough. This will make things like ktime_to_us() and ktime_to_ms() much faster. Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nicolas Pitre <nico@linaro.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23PM / hibernate: Remove unused functionRickard Strandqvist
Remove the function get_safe_write_buffer() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-23PM / QoS: Add debugfs support to view the list of constraintsNishanth Menon
PM QoS requests are notoriously hard to debug and made even more so due to their highly dynamic nature. Having visibility into the internal data representation per constraint allows us to have much better appreciation of potential issues or bad usage by drivers in the system. So introduce for all classes of PM QoS, an entry in /sys/kernel/debug/pm_qos that shall show all the current requests as well as the snapshot of the value these requests boil down to. For example: ==> /sys/kernel/debug/pm_qos/cpu_dma_latency <== 1: 4444: Active 2: 2000000000: Default 3: 2000000000: Default 4: 2000000000: Default Type=Minimum, Value=4444, Requests: active=1 / total=4 ==> /sys/kernel/debug/pm_qos/memory_bandwidth <== Empty! ... The actual value listed will have their meaning based on the QoS it is on, the 'Type' indicates what logic it would use to collate the information - Minimum, Maximum, or Sum. Value is the collation of all requests. This interface also compares the values with the defaults for the QoS class and marks the ones that are currently active. Signed-off-by: Nishanth Menon <nm@ti.com> Signed-off-by: Dave Gerlach <d-gerlach@ti.com> Acked-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-23X.509: shut up about included cert for silent buildArnd Bergmann
Every kernel build that includes X.509 support prints out a message like - Including cert signing_key.x509 This may be useful for some cases, but when doing automated build tests, it just means noise. To hide the message, this uses '$(kecho)' for printing the message, which means we still see it when building with V=1, but not at the normal level or when building with 'make -s'. Signed-off-by: Arnd Bergmann <arnd@arnd.de> Signed-off-by: David Howells <dhowells@redhat.com>
2015-01-23hrtimer: Prevent stale expiry time in hrtimer_interrupt()Thomas Gleixner
hrtimer_interrupt() has the following subtle issue: hrtimer_interrupt() lock(cpu_base); expires_next = KTIME_MAX; expire_timers(CLOCK_MONOTONIC); expires = get_next_timer(CLOCK_MONOTONIC); if (expires < expires_next) expires_next = expires; expire_timers(CLOCK_REALTIME); unlock(cpu_base); wakeup() hrtimer_start(CLOCK_MONOTONIC, newtimer); lock(cpu_base(); expires = get_next_timer(CLOCK_REALTIME); if (expires < expires_next) expires_next = expires; So because we already evaluated the next expiring timer of CLOCK_MONOTONIC we ignore that the expiry time of newtimer might be earlier than the overall next expiry time in hrtimer_interrupt(). To solve this, remove the caching of the next expiry value from hrtimer_interrupt() and reevaluate all active clock bases for the next expiry value. To avoid another code duplication, create a shared evaluation function and use it for hrtimer_get_next_event(), hrtimer_force_reprogram() and hrtimer_interrupt(). There is another subtlety in this mechanism: While hrtimer_interrupt() is running, we want to avoid to touch the hardware device because we will reprogram it anyway at the end of hrtimer_interrupt(). This works nicely for hrtimers which get rearmed via the HRTIMER_RESTART mechanism, because we drop out when the callback on that CPU is running. But that fails, if a new timer gets enqueued like in the example above. This has another implication: While hrtimer_interrupt() is running we refuse remote enqueueing of timers - see hrtimer_interrupt() and hrtimer_check_target(). hrtimer_interrupt() tries to prevent this by setting cpu_base->expires to KTIME_MAX, but that fails if a new timer gets queued. Prevent both the hardware access and the remote enqueue explicitely. We can loosen the restriction on the remote enqueue now due to reevaluation of the next expiry value, but that needs a seperate patch. Folded in a fix from Vignesh Radhakrishnan. Reported-and-tested-by: Stanislav Fomichev <stfomichev@yandex-team.ru> Based-on-patch-by: Stanislav Fomichev <stfomichev@yandex-team.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: vigneshr@codeaurora.org Cc: john.stultz@linaro.org Cc: viresh.kumar@linaro.org Cc: fweisbec@gmail.com Cc: cl@linux.com Cc: stuart.w.hayes@gmail.com Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1501202049190.5526@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23genirq: Set initial affinity in irq_set_affinity_hint()Jesse Brandeburg
Problem: The default behavior of the kernel is somewhat undesirable as all requested interrupts end up on CPU0 after registration. A user can run irqbalance daemon, or can manually configure smp_affinity via the proc filesystem, but the default affinity of the interrupts for all devices is always CPU zero, this can cause performance problems or very heavy cpu use of only one core if not noticed and fixed by the user. Solution: Enable the setting of the initial affinity directly when the driver sets a hint. This enabling means that kernel drivers can include an initial affinity setting for the interrupt, instead of all interrupts starting out life on CPU0. Of course if irqbalance is still running then the interrupts will get moved as before. This function is currently called by drivers in block, crypto, infiniband, ethernet and scsi trees, but only a handful, so these will be the devices affected by this change. Tested on i40e, and default interrupts were spread across the CPUs according to the hint. drivers/block/mtip32xx/mtip32xx.c:3 drivers/block/nvme-core.c:2 drivers/crypto/qat/qat_dh895xcc/adf_isr.c:3 drivers/infiniband/hw/qib/qib_iba7322.c:2 drivers/net/ethernet/intel/i40e/i40e_main.c:3 drivers/net/ethernet/intel/i40evf/i40evf_main.c:3 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3 drivers/net/ethernet/mellanox/mlx4/en_cq.c:2 drivers/scsi/hpsa.c:3 drivers/scsi/lpfc/lpfc_init.c:3 drivers/scsi/megaraid/megaraid_sas_base.c:8 drivers/soc/ti/knav_qmss_acc.c:1 drivers/soc/ti/knav_qmss_queue.c:2 drivers/virtio/virtio_pci_common.c:2 Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20141219012206.4220.27491.stgit@jbrandeb-cp2.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()Lai Jiangshan
The following race exists in the smpboot percpu threads management: CPU0 CPU1 cpu_up(2) get_online_cpus(); smpboot_create_threads(2); smpboot_register_percpu_thread(); for_each_online_cpu(); __smpboot_create_thread(); __cpu_up(2); This results in a missing per cpu thread for the newly onlined cpu2 and in a NULL pointer dereference on a consecutive offline of that cpu. Proctect smpboot_register_percpu_thread() with get_online_cpus() to prevent that. [ tglx: Massaged changelog and removed the change in smpboot_unregister_percpu_thread() because that's an optimization and therefor not stable material. ] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1406777421-12830-1-git-send-email-laijs@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23audit: replace getname()/putname() hacks with reference countersPaul Moore
In order to ensure that filenames are not released before the audit subsystem is done with the strings there are a number of hacks built into the fs and audit subsystems around getname() and putname(). To say these hacks are "ugly" would be kind. This patch removes the filename hackery in favor of a more conventional reference count based approach. The diffstat below tells most of the story; lots of audit/fs specific code is replaced with a traditional reference count based approach that is easily understood, even by those not familiar with the audit and/or fs subsystems. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-23audit: fix filename matching in __audit_inode() and __audit_inode_child()Paul Moore
In all likelihood there were some subtle, and perhaps not so subtle, bugs with filename matching in audit_inode() and audit_inode_child() for some time, however, recent changes to the audit filename code have definitely broken the filename matching code. The breakage could result in duplicate filenames in the audit log and other odd audit record entries. This patch fixes the filename matching code and restores some sanity to the filename audit records. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-23audit: enable filename recording via getname_kernel()Paul Moore
Enable recording of filenames in getname_kernel() and remove the kludgy workaround in __audit_inode() now that we have proper filename logging for kernel users. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-22x86, mpx: Strictly enforce empty prctl() argsDave Hansen
Description from Michael Kerrisk. He suggested an identical patch to one I had already coded up and tested. commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure that unused arguments are zero, as is done in many existing prctl()s and as should be done for all new prctl()s. This patch adds the required checks. Suggested-by: Andy Lutomirski <luto@amacapital.net> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module and param fixes from Rusty Russell: "Surprising number of fixes this merge window :( The first two are minor fallout from the param rework which went in this merge window. The next three are a series which fixes a longstanding (but never previously reported and unlikely , so no CC stable) race between kallsyms and freeing the init section. Finally, a minor cleanup as our module refcount will now be -1 during unload" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: make module_refcount() a signed integer. module: fix race in kallsyms resolution during module load success. module: remove mod arg from module_free, rename module_memfree(). module_arch_freeing_init(): new hook for archs before module->module_init freed. param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC param: initialize store function to NULL if not available.
2015-01-22tracing: Use IS_ERR() check for return value of tracing_init_dentry()Steven Rostedt (Red Hat)
tracing_init_dentry() will soon return NULL as a valid pointer for the top level tracing directroy. NULL can not be used as an error value. Instead, switch to ERR_PTR() and check the return status with IS_ERR(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-22tracing: Remove unneeded includes of debugfs.h and fs.hSteven Rostedt (Red Hat)
The creation of tracing files and directories is for the most part encapsulated in helper functions in trace.c. Other files do not need to include debugfs.h or fs.h, as they may have needed to in the past. Remove them from the files that do not need them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-22cgroup: prevent mount hang due to memory controller lifetimeJohannes Weiner
Since b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups"), re-mounting the memory controller after using it is very likely to hang. The cgroup core assumes that any remaining references after deleting a cgroup are temporary in nature, and synchroneously waits for them, but the above-mentioned commit has left-over page cache pin its css until it is reclaimed naturally. That being said, swap entries and charged kernel memory have been doing the same indefinite pinning forever, the bug is just more likely to trigger with left-over page cache. Reparenting kernel memory is highly impractical, which leaves changing the cgroup assumptions to reflect this: once a controller has been mounted and used, it has internal state that is independent from mount and cgroup lifetime. It can be unmounted and remounted, but it can't be reconfigured during subsequent mounts. Don't offline the controller root as long as there are any children, dead or alive. A remount will no longer wait for these old references to drain, it will simply mount the persistent controller state again. Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com> Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-01-22Merge branch 'fortglx/3.19-stable/time' of ↵Thomas Gleixner
https://git.linaro.org/people/john.stultz/linux into timers/urgent Pull urgent fixes from John Stultz: Two urgent fixes for user triggerable time related overflow issues
2015-01-22module: make module_refcount() a signed integer.Rusty Russell
James Bottomley points out that it will be -1 during unload. It's only used for diagnostics, so let's not hide that as it could be a clue as to what's gone wrong. Cc: Jason Wessel <jason.wessel@windriver.com> Acked-and-documention-added-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Masami Hiramatsu <maasami.hiramatsu.pt@hitachi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-21livepatch: fix uninitialized return valueJosh Poimboeuf
Fix a potentially uninitialized return value in klp_enable_func(). Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-21Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21Merge branch 'for-3.19-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "The xfs folks have been running into weird and very rare lockups for some time now. I didn't think this could have been from workqueue side because no one else was reporting it. This time, Eric had a kdump which we looked into and it turned out this actually was a workqueue bug and the bug has been there since the beginning of concurrency managed workqueue. A worker pool ensures forward progress of the workqueues associated with it by always having at least one worker reserved from executing work items. When the pool is under contention, the idle one tries to create more workers for the pool and if that doesn't succeed quickly enough, it calls the rescuers to the pool. This logic had a subtle race condition in an early exit path. When a worker invokes this manager function, the function may return %false indicating that the caller may proceed to executing work items either because another worker is already performing the role or conditions have changed and the pool is no longer under contention. The latter part depended on the assumption that whether more workers are necessary or not remains stable while the pool is locked; however, pool->nr_running (concurrency count) may change asynchronously and it getting bumped from zero asynchronously could send off the last idle worker to execute work items. The race window is fairly narrow, and, even when it gets triggered, the pool deadlocks iff if all work items get blocked on pending work items of the pool, which is highly unlikely but can be triggered by xfs. The patch removes the race window by removing the early exit path, which doesn't server any purpose anymore anyway" * 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix subtle pool management issue which can stall whole worker_pool
2015-01-20livepatch: support for repatching a functionJosh Poimboeuf
Add support for patching a function multiple times. If multiple patches affect a function, the function in the most recently enabled patch "wins". This enables a cumulative patch upgrade path, where each patch is a superset of previous patches. This requires restructuring the data a little bit. With the current design, where each klp_func struct has its own ftrace_ops, we'd have to unregister the old ops and then register the new ops, because FTRACE_OPS_FL_IPMODIFY prevents us from having two ops registered for the same function at the same time. That would leave a regression window where the function isn't patched at all (not good for a patch upgrade path). This patch replaces the per-klp_func ftrace_ops with a global klp_ops list, with one ftrace_ops per original function. A single ftrace_ops is shared between all klp_funcs which have the same old_addr. This allows the switch between function versions to happen instantaneously by updating the klp_ops struct's func_stack list. The winner is the klp_func at the top of the func_stack (front of the list). [ jkosina@suse.cz: turn WARN_ON() into WARN_ON_ONCE() in ftrace handler to avoid storm in pathological cases ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20livepatch: enforce patch stacking semanticsJosh Poimboeuf
Only allow the topmost patch on the stack to be enabled or disabled, so that patches can't be removed or added in an arbitrary order. Suggested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20audit: remove vestiges of vers_opsRichard Guy Briggs
Should have been removed with commit 18900909 ("audit: remove the old depricated kernel interface"). Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-01-20livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHINGMiroslav Benes
Change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING in Kconfigs. HAVE_ bools are prevalent there and we should go with the flow. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20module: fix race in kallsyms resolution during module load success.Rusty Russell
The kallsyms routines (module_symbol_name, lookup_module_* etc) disable preemption to walk the modules rather than taking the module_mutex: this is because they are used for symbol resolution during oopses. This works because there are synchronize_sched() and synchronize_rcu() in the unload and failure paths. However, there's one case which doesn't have that: the normal case where module loading succeeds, and we free the init section. We don't want a synchronize_rcu() there, because it would slow down module loading: this bug was introduced in 2009 to speed module loading in the first place. Thus, we want to do the free in an RCU callback. We do this in the simplest possible way by allocating a new rcu_head: if we put it in the module structure we'd have to worry about that getting freed. Reported-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-20module: remove mod arg from module_free, rename module_memfree().Rusty Russell
Nothing needs the module pointer any more, and the next patch will call it from RCU, where the module itself might no longer exist. Removing the arg is the safest approach. This just codifies the use of the module_alloc/module_free pattern which ftrace and bpf use. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: x86@kernel.org Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: linux-cris-kernel@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: nios2-dev@lists.rocketboards.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Cc: netdev@vger.kernel.org
2015-01-20module_arch_freeing_init(): new hook for archs before module->module_init freed.Rusty Russell
Archs have been abusing module_free() to clean up their arch-specific allocations. Since module_free() is also (ab)used by BPF and trace code, let's keep it to simple allocations, and provide a hook called before that. This means that avr32, ia64, parisc and s390 no longer need to implement their own module_free() at all. avr32 doesn't need module_finalize() either. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-kernel@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org
2015-01-20param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOCRusty Russell
ignore_lockdep is uninitialized, and sysfs_attr_init() doesn't initialize it, so memset to 0. Reported-by: Huang Ying <ying.huang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-19futex: Fix argument handling in futex_lock_pi() callsMichael Kerrisk
This patch fixes two separate buglets in calls to futex_lock_pi(): * Eliminate unused 'detect' argument * Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL The 'detect' argument of futex_lock_pi() seems never to have been used (when it was included with the initial PI mutex implementation in Linux 2.6.18, all checks against its value were disabled by ANDing against 0 (i.e., if (detect... && 0)), and with commit 778e9a9c3e7193ea9f434f382947155ffb59c755, any mention of this argument in futex_lock_pi() went way altogether. Its presence now serves only to confuse readers of the code, by giving the impression that the futex() FUTEX_LOCK_PI operation actually does use the 'val' argument. This patch removes the argument. The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes 'timeout' as one of its arguments. This misleads the reader into thinking that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible purpose; but it does not. Indeed, it cannot, because the checks at the start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations that do copy_from_user() on the timeout argument. So, in the FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change 'timeout' to 'NULL'. This patch does that. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Darren Hart <darren@dvhart.com> Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-18netlink: make nlmsg_end() and genlmsg_end() voidJohannes Berg
Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-17kernel: avoid overflow in cmp_rangeLouis Langholtz
Avoid overflow possibility. [ The overflow is purely theoretical, since this is used for memory ranges that aren't even close to using the full 64 bits, but this is the right thing to do regardless. - Linus ] Signed-off-by: Louis Langholtz <lou_langholtz@me.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Peter Anvin <hpa@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-16workqueue: fix subtle pool management issue which can stall whole worker_poolTejun Heo
A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Eric Sandeen <sandeen@sandeen.net> Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner <david@fromorbit.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: stable@vger.kernel.org
2015-01-17Merge tag 'trace-fixes-v3.19-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fixes from Steven Rostedt: "This holds a few fixes to the ftrace infrastructure as well as the mixture of function graph tracing and kprobes. When jprobes and function graph tracing is enabled at the same time it will crash the system: # modprobe jprobe_example # echo function_graph > /sys/kernel/debug/tracing/current_tracer After the first fork (jprobe_example probes it), the system will crash. This is due to the way jprobes copies the stack frame and does not do a normal function return. This messes up with the function graph tracing accounting which hijacks the return address from the stack and replaces it with a hook function. It saves the return addresses in a separate stack to put back the correct return address when done. But because the jprobe functions do not do a normal return, their stack addresses are not put back until the function they probe is called, which means that the probed function will get the return address of the jprobe handler instead of its own. The simple fix here was to disable function graph tracing while the jprobe handler is being called. While debugging this I found two minor bugs with the function graph tracing. The first was about the function graph tracer sharing its function hash with the function tracer (they both get filtered by the same input). The changing of the set_ftrace_filter would not sync the function recording records after a change if the function tracer was disabled but the function graph tracer was enabled. This was due to the update only checking one of the ops instead of the shared ops to see if they were enabled and should perform the sync. This caused the ftrace accounting to break and a ftrace_bug() would be triggered, disabling ftrace until a reboot. The second was that the check to update records only checked one of the filter hashes. It needs to test both the "filter" and "notrace" hashes. The "filter" hash determines what functions to trace where as the "notrace" hash determines what functions not to trace (trace all but these). Both hashes need to be passed to the update code to find out what change is being done during the update. This also broke the ftrace record accounting and triggered a ftrace_bug(). This patch set also include two more fixes that were reported separately from the kprobe issue. One was that init_ftrace_syscalls() was called twice at boot up. This is not a major bug, but that call performed a rather large kmalloc (NR_syscalls * sizeof(*syscalls_metadata)). The second call made the first one a memory leak, and wastes memory. The other fix is a regression caused by an update in the v3.19 merge window. The moving to enable events early, moved the enabling before PID 1 was created. The syscall events require setting the TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread() does not include the swapper task (PID 0), and ended up being a nop. A suggested fix was to add the init_task() to have its flag set, but I didn't really want to mess with PID 0 for this minor bug. Instead I disable and re-enable events again at early_initcall() where it use to be enabled. This also handles any other event that might have its own reg function that could break at early boot up" * tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix enabling of syscall events on the command line tracing: Remove extra call to init_ftrace_syscalls() ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing ftrace: Check both notrace and filter for old hash ftrace: Fix updating of filters for shared global_ops filters
2015-01-15Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', ↵Paul E. McKenney
'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD doc.2015.01.07a: Documentation updates. fixes.2015.01.15a: Miscellaneous fixes. preempt.2015.01.06a: Changes to handling of lists of preempted tasks. srcu.2015.01.06a: SRCU updates. stall.2015.01.16a: RCU CPU stall-warning updates and fixes. torture.2015.01.11a: RCU torture-test updates and fixes.
2015-01-15rcu: Initialize tiny RCU stall-warning timeouts at bootPaul E. McKenney
The current tiny RCU stall-warning code assumes that the jiffies counter starts at zero, however, it is sometimes initialized to other values, for example, -30,000. This commit therefore changes rcu_init() to invoke reset_cpu_stall_ticks() for both flavors of RCU to initialize the stall-warning times properly at boot. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15rcu: Fix RCU CPU stall detection in tiny implementationMiroslav Benes
The tiny RCU CPU stall detection depends on *rcp->curtail not being NULL. It is however a tail pointer and thus NULL by definition. Instead we should check rcp->rcucblist for the presence of pending callbacks which need to be processed. With this fix INFO about the stall is printed and jiffies_stall (jiffies at next stall) correctly updated. Note that the check for pending callback is necessary to avoid spurious warnings if there are no pendings callbacks. Signed-off-by: Miroslav Benes <mbenes@suse.cz> [ paulmck: Fused identical "if" statements, ported to -rcu. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15rcu: Add GP-kthread-starvation checks to CPU stall warningsPaul E. McKenney
This commit adds a message that is printed if the relevant grace-period kthread has not been able to run for the two seconds preceding the stall warning. (The two seconds is double the maximum interval between successive bouts of quiescent-state forcing.) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15rcu: Make cond_resched_rcu_qs() apply to normal RCU flavorsPaul E. McKenney
Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used in places where it would be useful for it to apply to the normal RCU flavors, rcu_preempt, rcu_sched, and rcu_bh. This is especially the case for workloads that aggressively overload the system, particularly those that generate large numbers of RCU updates on systems running NO_HZ_FULL CPUs. This commit therefore communicates quiescent states from cond_resched_rcu_qs() to the normal RCU flavors. Note that it is unfortunately necessary to leave the old ->passed_quiesce mechanism in place to allow quiescent states that apply to only one flavor to be recorded. (Yes, we could decrement ->rcu_qs_ctr_snap in that case, but that is not so good for debugging of RCU internals.) In addition, if one of the RCU flavor's grace period has stalled, this will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight quiescent state visible from other CPUs. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu() was used in preemptible code. ]
2015-01-15rcu: Optionally run grace-period kthreads at real-time priorityPaul E. McKenney
Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance (according to 0day test robot) and reduce the incidence of RCU CPU stall warnings. However, most systems do just fine with the default non-realtime priorities for these kthreads, and it does not make sense to expose the entire user base to any risk stemming from this change, given that this change is of use only to a few users running extremely heavy workloads. Therefore, this commit allows users to specify realtime priorities for the grace-period kthreads, but leaves them running SCHED_OTHER by default. The realtime priority may be specified at build time via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the rcutree.kthread_prio parameter. Either way, 0 says to continue the default SCHED_OTHER behavior and values from 1-99 specify that priority of SCHED_FIFO behavior. Note that a value of 0 is not permitted when the RCU_BOOST Kconfig parameter is specified. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15tracing: Fix enabling of syscall events on the command lineSteven Rostedt (Red Hat)
Commit 5f893b2639b2 "tracing: Move enabling tracepoints to just after rcu_init()" broke the enabling of system call events from the command line. The reason was that the enabling of command line trace events was moved before PID 1 started, and the syscall tracepoints require that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the swapper task (pid 0) is not part of that. Since the swapper task is the only task that is running at this early in boot, no task gets the flag set, and the tracepoint never gets reached. Instead of setting the swapper task flag (there should be no reason to do that), re-enabled trace events again after the init thread (PID 1) has been started. It requires disabling all command line events and re-enabling them, as just enabling them again will not reset the logic to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will be fooled into thinking that it was already set, and wont try setting it again. For this reason, we must first disable it and re-enable it. Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-15tracing: Remove extra call to init_ftrace_syscalls()Steven Rostedt (Red Hat)
trace_init() calls init_ftrace_syscalls() and then calls trace_event_init() which also calls init_ftrace_syscalls(). It makes more sense to only call it from trace_event_init(). Calling it twice wastes memory, as it allocates the syscall events twice, and loses the first copy of it. Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org Reported-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-15ftrace: Check both notrace and filter for old hashSteven Rostedt (Red Hat)
Using just the filter for checking for trampolines or regs is not enough when updating the code against the records that represent all functions. Both the filter hash and the notrace hash need to be checked. To trigger this bug (using trace-cmd and perf): # perf probe -a do_fork # trace-cmd start -B foo -e probe # trace-cmd record -p function_graph -n do_fork sleep 1 The trace-cmd record at the end clears the filter before it disables function_graph tracing and then that causes the accounting of the ftrace function records to become incorrect and causes ftrace to bug. Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org Cc: stable@vger.kernel.org [ still need to switch old_hash_ops to old_ops_hash ] Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-15ftrace: Fix updating of filters for shared global_ops filtersSteven Rostedt (Red Hat)
As the set_ftrace_filter affects both the function tracer as well as the function graph tracer, the ops that represent each have a shared ftrace_ops_hash structure. This allows both to be updated when the filter files are updated. But if function graph is enabled and the global_ops (function tracing) ops is not, then it is possible that the filter could be changed without the update happening for the function graph ops. This will cause the changes to not take place and may even cause a ftrace_bug to occur as it could mess with the trampoline accounting. The solution is to check if the ops uses the shared global_ops filter and if the ops itself is not enabled, to check if there's another ops that is enabled and also shares the global_ops filter. In that case, the modification still needs to be executed. Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org Cc: stable@vger.kernel.org # 3.17+ Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>