summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-03-14Merge branch 'for-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three cgroup fixes. Nothing critical: - the pids controller could trigger suspicious RCU warning spuriously. Fixed. - in the debug controller, %p -> %pK to protect kernel pointer from getting exposed. - documentation formatting fix" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroups: censor kernel pointer in debug files cgroup/pids: remove spurious suspicious RCU usage warning cgroup: Fix indenting in PID controller documentation
2017-03-14Merge branch 'for-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "If a delayed work is queued with NULL @wq, workqueue code explodes after the timer expires at which point it's difficult to tell who the culprit was. This actually happened and the offender was net/smc this time. Add an explicit sanity check for it in the queueing path" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
2017-03-14futex: Add missing error handling to FUTEX_REQUEUE_PIPeter Zijlstra
Thomas spotted that fixup_pi_state_owner() can return errors and we fail to unlock the rt_mutex in that case. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14futex: Fix potential use-after-free in FUTEX_REQUEUE_PIPeter Zijlstra
While working on the futex code, I stumbled over this potential use-after-free scenario. Dmitry triggered it later with syzkaller. pi_mutex is a pointer into pi_state, which we drop the reference on in unqueue_me_pi(). So any access to that pointer after that is bad. Since other sites already do rt_mutex_unlock() with hb->lock held, see for example futex_lock_pi(), simply move the unlock before unqueue_me_pi(). Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14cpu/hotplug: Serialize callback invocations properSebastian Andrzej Siewior
The setup/remove_state/instance() functions in the hotplug core code are serialized against concurrent CPU hotplug, but unfortunately not serialized against themself. As a consequence a concurrent invocation of these function results in corruption of the callback machinery because two instances try to invoke callbacks on remote cpus at the same time. This results in missing callback invocations and initiator threads waiting forever on the completion. The obvious solution to replace get_cpu_online() with cpu_hotplug_begin() is not possible because at least one callsite calls into these functions from a get_online_cpu() locked region. Extend the protection scope of the cpuhp_state_mutex from solely protecting the state arrays to cover the callback invocation machinery as well. Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface") Reported-and-tested-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: hpa@zytor.com Cc: mingo@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOLNaveen N. Rao
commit fc62d0207ae0 ("kprobes: Introduce weak variant of kprobe_exceptions_notify()") used the __kprobes annotation to exclude kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a better way to do this enabling the symbol to be discovered as being blacklisted, change over to using NOKPROBE_SYMBOL(). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/3f25bf400da5c222cd9b10eec6ded2d6b58209f8.1488991670.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-14drm/i915: annote drop_caches debugfs interface with lockdepDaniel Vetter
The trouble we have is that we can't really test all the shrinker recursion stuff exhaustively in BAT because any kind of thrashing stress test just takes too long. But that leaves a really big gap open, since shrinker recursions are one of the most annoying bugs. Now lockdep already has support for checking allocation deadlocks: - Direct reclaim paths are marked up with lockdep_set_current_reclaim_state() and lockdep_clear_current_reclaim_state(). - Any allocation paths are marked with lockdep_trace_alloc(). If we simply mark up our debugfs with the reclaim annotations, any code and locks taken in there will automatically complete the picture with any allocation paths we already have, as long as we have a simple testcase in BAT which throws out a few objects using this interface. Not stress test or thrashing needed at all. v2: Need to EXPORT_SYMBOL_GPL to make it compile as a module. v3: Fixup rebase fail (spotted by Chris). Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-kernel@vger.kernel.org Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://patchwork.freedesktop.org/patch/msgid/20170312205340.16202-1-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2017-03-13rlimits: Print more information when CPU/RT limits are exceededArun Raghavan
When a process is sent a SIGKILL because it exceeded CPU or RT limits, the cause may not be obvious in userspace -- daemonised processes just get killed, and even foreground process just see a 'Killed' message. The lack of any information on why this might be happening in logs can be confusing to users who are not aware of this mechanism. Add messages which dump the process name and tid in dmesg when a process exceeds its CPU or RT limits (soft and hard) in order to make it clearer to people debugging such issues. Signed-off-by: Arun Raghavan <arun@arunraghavan.net> Link: http://lkml.kernel.org/r/20170301145309.27214-1-arun@arunraghavan.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13perf: Add PERF_RECORD_NAMESPACES to include namespaces related infoHari Bathini
With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing support for namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for recording namespaces related info. By recording info for every namespace, it is left to userspace to take a call on the definition of a container and trace containers by updating perf tool accordingly. Each namespace has a combination of device and inode numbers. Though every namespace has the same device number currently, that may change in future to avoid the need for a namespace of namespaces. Considering such possibility, record both device and inode numbers separately for each namespace. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-12cpufreq: schedutil: Refactor sugov_next_freq_shared()Viresh Kumar
The loop in sugov_next_freq_shared() contains an if block to skip the loop for the current CPU. This turns out to be an unnecessary conditional in the scheduler's hot-path for every CPU in the policy. It would be better to drop the conditional and make the loop treat all the CPUs in the same way. That would eliminate the need of calling sugov_iowait_boost() at the top of the routine. To keep the code optimized to return early if the current CPU has RT/DL flags set, move the flags check to sugov_update_shared() instead in order to avoid the function call entirely. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12cpufreq: schedutil: Redefine the rate_limit_us tunableViresh Kumar
The rate_limit_us tunable is intended to reduce the possible overhead from running the schedutil governor. However, that overhead can be divided into two separate parts: the governor computations and the invocation of the scaling driver to set the CPU frequency. The latter is where the real overhead comes from. The former is much less expensive in terms of execution time and running it every time the governor callback is invoked by the scheduler, after rate_limit_us interval has passed since the last frequency update, would not be a problem. For this reason, redefine the rate_limit_us tunable so that it means the minimum time that has to pass between two consecutive invocations of the scaling driver by the schedutil governor (to set the CPU frequency). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - a fix for the kexec/purgatory regression which was introduced in the merge window via an innocent sparse fix. We could have reverted that commit, but on deeper inspection it turned out that the whole machinery is neither documented nor robust. So a proper cleanup was done instead - the fix for the TLB flush issue which was discovered recently - a simple typo fix for a reboot quirk * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tlb: Fix tlb flushing when lguest clears PGE kexec, x86/purgatory: Unbreak it and clean it up x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
2017-03-11genirq: Add support for nested shared IRQsCharles Keepax
On a specific audio system an interrupt input of an audio CODEC is used as a shared interrupt. That interrupt input is handled by a CODEC specific irq chip driver and triggers a CPU interrupt via the CODEC irq output line. The CODEC interrupt handler demultiplexes the CODEC interrupt inputs and the interrupt handlers for these demultiplexed inputs run nested in the context of the CODEC interrupt handler. The demultiplexed interrupts use handle_nested_irq() as their interrupt handler, which unfortunately has no support for shared interrupts. So the above hardware cannot be supported. Add shared interrupt support to handle_nested_irq() by iterating over the interrupt action chain. [ tglx: Massaged changelog ] Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Cc: patches@opensource.wolfsonmicro.com Link: http://lkml.kernel.org/r/1488904098-5350-1-git-send-email-ckeepax@opensource.wolfsonmicro.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-10kexec, x86/purgatory: Unbreak it and clean it upThomas Gleixner
The purgatory code defines global variables which are referenced via a symbol lookup in the kexec code (core and arch). A recent commit addressing sparse warnings made these static and thereby broke kexec_file. Why did this happen? Simply because the whole machinery is undocumented and lacks any form of forward declarations. The variable names are unspecific and lack a prefix, so adding forward declarations creates shadow variables in the core code. Aside of that the code relies on magic constants and duplicate struct definitions with no way to ensure that these things stay in sync. The section placement of the purgatory variables happened by chance and not by design. Unbreak kexec and cleanup the mess: - Add proper forward declarations and document the usage - Use common struct definition - Use the proper common defines instead of magic constants - Add a purgatory_ prefix to have a proper name space - Use ARRAY_SIZE() instead of a homebrewn reimplementation - Add proper sections to the purgatory variables [ From Mike ] Fixes: 72042a8c7b01 ("x86/purgatory: Make functions and variables static") Reported-by: Mike Galbraith <<efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Tobin C. Harding" <me@tobin.cc> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "26 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits) userfaultfd: remove wrong comment from userfaultfd_ctx_get() fat: fix using uninitialized fields of fat_inode/fsinfo_inode sh: cayman: IDE support fix kasan: fix races in quarantine_remove_cache() kasan: resched in quarantine_remove_cache() mm: do not call mem_cgroup_free() from within mem_cgroup_alloc() thp: fix another corner case of munlock() vs. THPs rmap: fix NULL-pointer dereference on THP munlocking mm/memblock.c: fix memblock_next_valid_pfn() userfaultfd: selftest: vm: allow to build in vm/ directory userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED userfaultfd: non-cooperative: fix fork fctx->new memleak mm/cgroup: avoid panic when init with low memory drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h mm/vmstats: add thp_split_pud event for clarity include/linux/fs.h: fix unsigned enum warning with gcc-4.2 userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete userfaultfd: non-cooperative: robustness check userfaultfd: non-cooperative: rollback userfaultfd_exit x86, mm: unify exit paths in gup_pte_range() ...
2017-03-09userfaultfd: non-cooperative: rollback userfaultfd_exitAndrea Arcangeli
Patch series "userfaultfd non-cooperative further update for 4.11 merge window". Unfortunately I noticed one relevant bug in userfaultfd_exit while doing more testing. I've been doing testing before and this was also tested by kbuild bot and exercised by the selftest, but this bug never reproduced before. I dropped userfaultfd_exit as result. I dropped it because of implementation difficulty in receiving signals in __mmput and because I think -ENOSPC as result from the background UFFDIO_COPY should be enough already. Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit wasn't exercised by the selftest and when I tried to exercise it, after moving it to a more correct place in __mmput where it would make more sense and where the vma list is stable, it resulted in the event_wait_completion in D state. So then I added the second patch to be sure even if we call userfaultfd_event_wait_completion too late during task exit(), we won't risk to generate tasks in D state. The same check exists in handle_userfault() for the same reason, except it makes a difference there, while here is just a robustness check and it's run under WARN_ON_ONCE. While looking at the userfaultfd_event_wait_completion() function I looked back at its callers too while at it and I think it's not ok to stop executing dup_fctx on the fcs list because we relay on userfaultfd_event_wait_completion to execute userfaultfd_ctx_put(fctx->orig) which is paired against userfaultfd_ctx_get(fctx->orig) in dup_userfault just before list_add(fcs). This change only takes care of fctx->orig but this area also needs further review looking for similar problems in fctx->new. The only patch that is urgent is the first because it's an use after free during a SMP race condition that affects all processes if CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably impossible without SLUB poisoning enabled. This patch (of 3): I once reproduced this oops with the userfaultfd selftest, it's not easily reproducible and it requires SLUB poisoning to reproduce. general protection fault: 0000 [#1] SMP Modules linked in: CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000 RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202 RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600 RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000 R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700 FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: do_exit+0x297/0xd10 SyS_exit+0x17/0x20 tracesys+0xdd/0xe2 Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80 RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP <ffff8801f833fe80> ---[ end trace 9fecd6dcb442846a ]--- In the debugger I located the "mm" pointer in the stack and walking mm->mmap->vm_next through the end shows the vma->vm_next list is fully consistent and it is null terminated list as expected. So this has to be an SMP race condition where userfaultfd_exit was running while the vma list was being modified by another CPU. When userfaultfd_exit() run one of the ->vm_next pointers pointed to SLAB_POISON (RBX is the vma pointer and is 0x6b6b..). The reason is that it's not running in __mmput but while there are still other threads running and it's not holding the mmap_sem (it can't as it has to wait the even to be received by the manager). So this is an use after free that was happening for all processes. One more implementation problem aside from the race condition: userfaultfd_exit has really to check a flag in mm->flags before walking the vma or it's going to slowdown the exit() path for regular tasks. One more implementation problem: at that point signals can't be delivered so it would also create a task in D state if the manager doesn't read the event. The major design issue: it overall looks superfluous as the manager can check for -ENOSPC in the background transfer: if (mmget_not_zero(ctx->mm)) { [..] } else { return -ENOSPC; } It's safer to roll it back and re-introduce it later if at all. [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT] Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09scripts/spelling.txt: add "overide" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: overide||override While we are here, fix the doubled "address" in the touched line Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt. Also, fix the comment block style in the touched hunks in drivers/media/dvb-frontends/drx39xyj/drx_driver.h. Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09scripts/spelling.txt: add "disble(d)" pattern and fix typo instancesMasahiro Yamada
Fix typos and add the following to the scripts/spelling.txt: disble||disable disbled||disabled I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c untouched. The macro is not referenced at all, but this commit is touching only comment blocks just in case. Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09Merge tag 'pm-4.11-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix several issues in the intel_pstate driver and one issue in the schedutil cpufreq governor, clean up that governor a bit and hook up existing code for disabling cpufreq to a new kernel command line option. Specifics: - Three fixes for intel_pstate problems related to the passive mode (in which it acts as a regular cpufreq scaling driver), two for the handling of global P-state limits and one for the handling of the cpu_frequency tracepoint in that mode (Rafael Wysocki). - Three fixes for the handling of P-state limits in intel_pstate in the active mode (Rafael Wysocki). - Introduction of a new cpufreq.off=1 kernel command line argument that will disable cpufreq entirely if passed to the kernel and is simply hooked up to the existing code used by Xen (Len Brown). - Fix for the schedutil cpufreq governor to prevent it from using stale raw frequency values in configurations with mutiple CPUs sharing one policy object and a cleanup for it reducing its overhead slightly (Viresh Kumar)" * tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy cpufreq: intel_pstate: Fix intel_pstate_verify_policy() cpufreq: intel_pstate: Fix global settings in active mode cpufreq: Add the "cpufreq.off=1" cmdline option cpufreq: schedutil: Pass sg_policy to get_next_freq() cpufreq: schedutil: move cached_raw_freq to struct sugov_policy cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy() cpufreq: intel_pstate: Do not use performance_limits in passive mode
2017-03-09bpf: convert htab map to hlist_nullsAlexei Starovoitov
when all map elements are pre-allocated one cpu can delete and reuse htab_elem while another cpu is still walking the hlist. In such case the lookup may miss the element. Convert hlist to hlist_nulls to avoid such scenario. When bucket lock is taken there is no need to take such precautions, so only convert map_lookup and map_get_next to nulls. The race window is extremely small and only reproducible with explicit udelay() inside lookup_nulls_elem_raw() Similar to hlist add hlist_nulls_for_each_entry_safe() and hlist_nulls_entry_safe() helpers. Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Reported-by: Jonathan Perry <jonperry@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09bpf: fix struct htab_elem layoutAlexei Starovoitov
when htab_elem is removed from the bucket list the htab_elem.hash_node.next field should not be overridden too early otherwise we have a tiny race window between lookup and delete. The bug was discovered by manual code analysis and reproducible only with explicit udelay() in lookup_elem_raw(). Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Reported-by: Jonathan Perry <jonperry@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-08kernel: convert css_set.refcount from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-08sched/headers: fix up header file dependency on <linux/sched/signal.h>Linus Torvalds
The scheduler header file split and cleanups ended up exposing a few nasty header file dependencies, and in particular it showed how we in <linux/wait.h> ended up depending on "signal_pending()", which now comes from <linux/sched/signal.h>. That's a very subtle and annoying dependency, which already caused a semantic merge conflict (see commit e58bc927835a "Pull overlayfs updates from Miklos Szeredi", which added that fixup in the merge commit). It turns out that we can avoid this dependency _and_ improve code generation by moving the guts of the fairly nasty helper #define __wait_event_interruptible_locked() to out-of-line code. The code that includes the signal_pending() check is all in the slow-path where we actually go to sleep waiting for the event anyway, so using a helper function is the right thing to do. Using a helper function is also what we already did for the non-locked versions, see the "__wait_event*()" macros and the "prepare_to_wait*()" set of helper functions. We might want to try to unify all these macro games, we have a _lot_ of subtly different wait-event loops. But this is the minimal patch to fix the annoying header dependency. Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08livepatch: make klp_mutex proper part of APIJiri Kosina
klp_mutex is shared between core.c and transition.c, and as such would rather be properly located in a header so that we don't have to play 'extern' games from .c sources. This also silences sparse warning (wrongly) suggesting that klp_mutex should be defined static. Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: allow removal of a disabled patchJosh Poimboeuf
Currently we do not allow patch module to unload since there is no method to determine if a task is still running in the patched code. The consistency model gives us the way because when the unpatching finishes we know that all tasks were marked as safe to call an original function. Thus every new call to the function calls the original code and at the same time no task can be somewhere in the patched code, because it had to leave that code to be marked as safe. We can safely let the patch module go after that. Completion is used for synchronization between module removal and sysfs infrastructure in a similar way to commit 942e443127e9 ("module: Fix mod->mkobj.kobj potentially freed too early"). Note that we still do not allow the removal for immediate model, that is no consistency model. The module refcount may increase in this case if somebody disables and enables the patch several times. This should not cause any harm. With this change a call to try_module_get() is moved to __klp_enable_patch from klp_register_patch to make module reference counting symmetric (module_put() is in a patch disable path) and to allow to take a new reference to a disabled module when being enabled. Finally, we need to be very careful about possible races between klp_unregister_patch(), kobject_put() functions and operations on the related sysfs files. kobject_put(&patch->kobj) must be called without klp_mutex. Otherwise, it might be blocked by enabled_store() that needs the mutex as well. In addition, enabled_store() must check if the patch was not unregisted in the meantime. There is no need to do the same for other kobject_put() callsites at the moment. Their sysfs operations neither take the lock nor they access any data that might be freed in the meantime. There was an attempt to use kobjects the right way and prevent these races by design. But it made the patch definition more complicated and opened another can of worms. See https://lkml.kernel.org/r/1464018848-4303-1-git-send-email-pmladek@suse.com [Thanks to Petr Mladek for improving the commit message.] Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: change to a per-task consistency modelJosh Poimboeuf
Change livepatch to use a basic per-task consistency model. This is the foundation which will eventually enable us to patch those ~10% of security patches which change function or data semantics. This is the biggest remaining piece needed to make livepatch more generally useful. This code stems from the design proposal made by Vojtech [1] in November 2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Patches are applied on a per-task basis, when the task is deemed safe to switch over. When a patch is enabled, livepatch enters into a transition state where tasks are converging to the patched state. Usually this transition state can complete in a few seconds. The same sequence occurs when a patch is disabled, except the tasks converge from the patched state to the unpatched state. An interrupt handler inherits the patched state of the task it interrupts. The same is true for forked tasks: the child inherits the patched state of the parent. Livepatch uses several complementary approaches to determine when it's safe to patch tasks: 1. The first and most effective approach is stack checking of sleeping tasks. If no affected functions are on the stack of a given task, the task is patched. In most cases this will patch most or all of the tasks on the first try. Otherwise it'll keep trying periodically. This option is only available if the architecture has reliable stacks (HAVE_RELIABLE_STACKTRACE). 2. The second approach, if needed, is kernel exit switching. A task is switched when it returns to user space from a system call, a user space IRQ, or a signal. It's useful in the following cases: a) Patching I/O-bound user tasks which are sleeping on an affected function. In this case you have to send SIGSTOP and SIGCONT to force it to exit the kernel and be patched. b) Patching CPU-bound user tasks. If the task is highly CPU-bound then it will get patched the next time it gets interrupted by an IRQ. c) In the future it could be useful for applying patches for architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In this case you would have to signal most of the tasks on the system. However this isn't supported yet because there's currently no way to patch kthreads without HAVE_RELIABLE_STACKTRACE. 3. For idle "swapper" tasks, since they don't ever exit the kernel, they instead have a klp_update_patch_state() call in the idle loop which allows them to be patched before the CPU enters the idle state. (Note there's not yet such an approach for kthreads.) All the above approaches may be skipped by setting the 'immediate' flag in the 'klp_patch' struct, which will disable per-task consistency and patch all tasks immediately. This can be useful if the patch doesn't change any function or data semantics. Note that, even with this flag set, it's possible that some tasks may still be running with an old version of the function, until that function returns. There's also an 'immediate' flag in the 'klp_func' struct which allows you to specify that certain functions in the patch can be applied without per-task consistency. This might be useful if you want to patch a common function like schedule(), and the function change doesn't need consistency but the rest of the patch does. For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user must set patch->immediate which causes all tasks to be patched immediately. This option should be used with care, only when the patch doesn't change any function or data semantics. In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE may be allowed to use per-task consistency if we can come up with another way to patch kthreads. The /sys/kernel/livepatch/<patch>/transition file shows whether a patch is in transition. Only a single patch (the topmost patch on the stack) can be in transition at a given time. A patch can remain in transition indefinitely, if any of the tasks are stuck in the initial patch state. A transition can be reversed and effectively canceled by writing the opposite value to the /sys/kernel/livepatch/<patch>/enabled file while the transition is in progress. Then all the tasks will attempt to converge back to the original patch state. [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Ingo Molnar <mingo@kernel.org> # for the scheduler changes Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: store function sizesJosh Poimboeuf
For the consistency model we'll need to know the sizes of the old and new functions to determine if they're on the stacks of any tasks. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: use kstrtobool() in enabled_store()Josh Poimboeuf
The sysfs enabled value is a boolean, so kstrtobool() is a better fit for parsing the input string since it does the range checking for us. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: move patching functions into patch.cJosh Poimboeuf
Move functions related to the actual patching of functions and objects into a new patch.c file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: remove unnecessary object loaded checkJosh Poimboeuf
klp_patch_object()'s callers already ensure that the object is loaded, so its call to klp_is_object_loaded() is unnecessary. This will also make it possible to move the patching code into a separate file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: separate enabled and patched statesJosh Poimboeuf
Once we have a consistency model, patches and their objects will be enabled and disabled at different times. For example, when a patch is disabled, its loaded objects' funcs can remain registered with ftrace indefinitely until the unpatching operation is complete and they're no longer in use. It's less confusing if we give them different names: patches can be enabled or disabled; objects (and their funcs) can be patched or unpatched: - Enabled means that a patch is logically enabled (but not necessarily fully applied). - Patched means that an object's funcs are registered with ftrace and added to the klp_ops func stack. Also, since these states are binary, represent them with booleans instead of ints. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: create temporary klp_update_patch_state() stubJosh Poimboeuf
Create temporary stubs for klp_update_patch_state() so we can add TIF_PATCH_PENDING to different architectures in separate patches without breaking build bisectability. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08stacktrace/x86: add function for detecting reliable stack tracesJosh Poimboeuf
For live patching and possibly other use cases, a stack trace is only useful if it can be assured that it's completely reliable. Add a new save_stack_trace_tsk_reliable() function to achieve that. Note that if the target task isn't the current task, and the target task is allowed to run, then it could be writing the stack while the unwinder is reading it, resulting in possible corruption. So the caller of save_stack_trace_tsk_reliable() must ensure that the task is either 'current' or inactive. save_stack_trace_tsk_reliable() relies on the x86 unwinder's detection of pt_regs on the stack. If the pt_regs are not user-mode registers from a syscall, then they indicate an in-kernel interrupt or exception (e.g. preemption or a page fault), in which case the stack is considered unreliable due to the nature of frame pointers. It also relies on the x86 unwinder's detection of other issues, such as: - corrupted stack data - stack grows the wrong way - stack walk doesn't reach the bottom - user didn't provide a large enough entries array Such issues are reported by checking unwind_error() and !unwind_done(). Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can determine at build time whether the function is implemented. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Ingo Molnar <mingo@kernel.org> # for the x86 changes Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-07Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "This includes a fix for lockups caused by incorrect nsecs related cleanup, and a capabilities check fix for timerfd" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC timerfd: Only check CAP_WAKE_ALARM when it is needed
2017-03-07Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A fix for KVM's scheduler clock which (erroneously) was always marked unstable, a fix for RT/DL load balancing, plus latency fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface sched/core: Fix pick_next_task() for RT,DL sched/fair: Make select_idle_cpu() more aggressive
2017-03-07Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This includes a fix for a crash if certain special addresses are kprobed, plus does a rename of two Kconfig variables that were a minor misnomer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
2017-03-07Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: - Change the new refcount_t warnings from WARN() to WARN_ONCE() - two ww_mutex fixes - plus a new lockdep self-consistency check for a bug that triggered in practice * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Adjust the lock number for stress test locking/lockdep: Add nest_lock integrity test locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests locking/refcounts: Change WARN() to WARN_ONCE()
2017-03-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fix from Eric Biederman: "This fixes a race between put_ucounts and get_ucounts that can cause a use after free. The fix works by simplifying the code and so there is not even a temptation to be clever and play spinlock vs atomic reference games" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucount: Remove the atomicity from ucount->count
2017-03-07Merge tag 'trace-v4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "There was some breakage with the changes for jump labels in the 4.11 merge window: - powerpc broke as jump labels uses the two LSB bits as flags in initialization. A check was added to make sure that all jump label entries were 4 bytes aligned, but powerpc didn't work that way for modules. Adding an alignment in the module linker script appeared to be the best solution. - Jump labels also added an anonymous union to access those LSB bits as a normal long. But because this structure had static initialization, it broke older compilers that could not statically initialize anonymous unions without brackets. - The command line parameter for setting function graph filter broke the "EMPTY_HASH" descriptor by modifying it instead of creating a new hash to hold the entries. - The command line parameter ftrace_graph_max_depth was added to allow its setting at boot time. It uses existing code and only the command line hook was added. This is not really a fix, but as it uses existing code without affecting anything else, I added it to this release. It was ready before the merge window closed, but I wanted to let it sit in linux-next for a couple of days first" * tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/graph: Add ftrace_graph_max_depth kernel parameter tracing: Add #undef to fix compile error jump_label: Add comment about initialization order for anonymous unions jump_label: Fix anonymous union initialization module: set __jump_table alignment to 8 ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter tracing: Fix code comment for ftrace_ops_get_func()
2017-03-07jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSECFrederic Weisbecker
commit 93825f2ec736 converted NSEC_PER_SEC to TICK_NSEC because the author confused NSEC_PER_JIFFY with NSEC_PER_SEC. As a result, the calculation of refined jiffies got broken, triggering lockups. Fixes: 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY") Reported-and-tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1488880534-3777-1-git-send-email-fweisbec@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-07Merge tag 'perf-core-for-mingo-4.11-20170306' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: New features: - Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis) E.g.: # perf report -s symbol_size,symbol Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623 Overhead Symbol size Symbol 14.55% 326 [k] flush_tlb_mm_range 7.20% 1045 [k] filemap_map_pages 5.82% 124 [k] vma_interval_tree_insert 5.18% 2430 [k] unmap_page_range 2.57% 571 [k] vma_interval_tree_remove 1.94% 494 [k] page_add_file_rmap 1.82% 740 [k] page_remove_rmap 1.66% 1017 [k] release_pages 1.57% 1636 [k] update_blocked_averages 1.57% 76 [k] unlock_page - Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim) Change in behaviour: - Make system wide (-a) the default option if no target was specified and one of following conditions is met: - No workload specified (current behaviour) - A workload is specified but all requested events are system wide ones, like uncore ones. (Jiri Olsa) Fixes: - Add missing initialization to the instruction decoder used in the intel PT/BTS code, which was causing lots of failures in 'perf test', looking for a value when there was none (Adrian Hunter) Infrastructure changes: - Add arch code needed to adopt the kernel's refcount_t to aid in catching bugs when using atomic_t as a reference counter, basically cmpxchg related functions (Arnaldo Carvalho de Melo) - Convert the code using atomic_t as reference counts to refcount_t (Elena Rashetova) - Add feature test for sched_getcpu() to more easily check for its presence in the many libc implementations and accross different versions of such C libraries (Arnaldo Carvalho de Melo) - Issue a HW watchdog disable hint in 'perf stat' for when some of the requested events can't get counted because a PMU counter is taken by that watchdog (Borislav Petkov). - Add mapping for Intel's KnightsMill PMU events (Karol Wachowski) Documentation changes: - Clarify the term 'convergence' in: perf bench numa numa-mem -h --show_convergence (Jiri Olsa) Kernel code changes: - Ensure probe location is at function entry in kretprobes (Naveen N. Rao) - Allow return probes with offsets and absolute addresses (Naveen N. Rao) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-06ucount: Remove the atomicity from ucount->countEric W. Biederman
Always increment/decrement ucount->count under the ucounts_lock. The increments are there already and moving the decrements there means the locking logic of the code is simpler. This simplification in the locking logic fixes a race between put_ucounts and get_ucounts that could result in a use-after-free because the count could go zero then be found by get_ucounts and then be freed by put_ucounts. A bug presumably this one was found by a combination of syzkaller and KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov spotted the race in the code. Cc: stable@vger.kernel.org Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user") Reported-by: JongHwan Kim <zzoru007@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-03-06workqueue: use setup_deferrable_timerGeliang Tang
Use setup_deferrable_timer() instead of init_timer_deferrable() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06workqueue: trigger WARN if queue_delayed_work() is called with NULL @wqTejun Heo
If queue_delayed_work() gets called with NULL @wq, the kernel will oops asynchronuosly on timer expiration which isn't too helpful in tracking down the offender. This actually happened with smc. __queue_delayed_work() already does several input sanity checks synchronously. Add NULL @wq check. Reported-by: Dave Jones <davej@codemonkey.org.uk> Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroups: censor kernel pointer in debug filesKees Cook
As found in grsecurity, this avoids exposing a kernel pointer through the cgroup debug entries. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroup/pids: remove spurious suspicious RCU usage warningTejun Heo
pids_can_fork() is special in that the css association is guaranteed to be stable throughout the function and thus doesn't need RCU protection around task_css access. When determining the css to charge the pid, task_css_check() is used to override the RCU sanity check. While adding a warning message on fork rejection from pids limit, 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") incorrectly added a task_css access which is neither RCU protected or explicitly annotated. This triggers the following suspicious RCU usage warning when RCU debugging is enabled. cgroup: fork rejected by pids controller in =============================== [ ERR: suspicious RCU usage. ] 4.10.0-work+ #1 Not tainted ------------------------------- ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by bash/1748: #0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0 stack backtrace: CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 Call Trace: dump_stack+0x68/0x93 lockdep_rcu_suspicious+0xd7/0x110 pids_can_fork+0x1c7/0x1d0 cgroup_can_fork+0x67/0xc0 copy_process.part.58+0x1709/0x1e90 _do_fork+0xe6/0x6e0 SyS_clone+0x19/0x20 do_syscall_64+0x5c/0x140 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f7853fab93a RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700 R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4 R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d /asdf There's no reason to dereference task_css again here when the associated css is already available. Fix it by replacing the task_cgroup() call with css->cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Fixes: 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") Cc: Kenny Yu <kennyyu@fb.com> Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06kernel: convert cgroup_namespace.count from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-05bpf: add get_next_key callback to LPM mapAlexei Starovoitov
map_get_next_key callback is mandatory. Supply dummy handler. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-06prlimit,security,selinux: add a security hook for prlimitStephen Smalley
When SELinux was first added to the kernel, a process could only get and set its own resource limits via getrlimit(2) and setrlimit(2), so no MAC checks were required for those operations, and thus no security hooks were defined for them. Later, SELinux introduced a hook for setlimit(2) with a check if the hard limit was being changed in order to be able to rely on the hard limit value as a safe reset point upon context transitions. Later on, when prlimit(2) was added to the kernel with the ability to get or set resource limits (hard or soft) of another process, LSM/SELinux was not updated other than to pass the target process to the setrlimit hook. This resulted in incomplete control over both getting and setting the resource limits of another process. Add a new security_task_prlimit() hook to the check_prlimit_permission() function to provide complete mediation. The hook is only called when acting on another task, and only if the existing DAC/capability checks would allow access. Pass flags down to the hook to indicate whether the prlimit(2) call will read, write, or both read and write the resource limits of the target process. The existing security_task_setrlimit() hook is left alone; it continues to serve a purpose in supporting the ability to make decisions based on the old and/or new resource limit values when setting limits. This is consistent with the DAC/capability logic, where check_prlimit_permission() performs generic DAC/capability checks for acting on another task, while do_prlimit() performs a capability check based on a comparison of the old and new resource limits. Fix the inline documentation for the hook to match the code. Implement the new hook for SELinux. For setting resource limits, we reuse the existing setrlimit permission. Note that this does overload the setrlimit permission to mean the ability to set the resource limit (soft or hard) of another process or the ability to change one's own hard limit. For getting resource limits, a new getrlimit permission is defined. This was not originally defined since getrlimit(2) could only be used to obtain a process' own limits. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-05cpufreq: schedutil: Pass sg_policy to get_next_freq()Viresh Kumar
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of get_next_freq() already have. Pass sg_policy instead of sg_cpu to get_next_freq(), to make it more efficient. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>