summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-03-08livepatch: make klp_mutex proper part of APIJiri Kosina
klp_mutex is shared between core.c and transition.c, and as such would rather be properly located in a header so that we don't have to play 'extern' games from .c sources. This also silences sparse warning (wrongly) suggesting that klp_mutex should be defined static. Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: allow removal of a disabled patchJosh Poimboeuf
Currently we do not allow patch module to unload since there is no method to determine if a task is still running in the patched code. The consistency model gives us the way because when the unpatching finishes we know that all tasks were marked as safe to call an original function. Thus every new call to the function calls the original code and at the same time no task can be somewhere in the patched code, because it had to leave that code to be marked as safe. We can safely let the patch module go after that. Completion is used for synchronization between module removal and sysfs infrastructure in a similar way to commit 942e443127e9 ("module: Fix mod->mkobj.kobj potentially freed too early"). Note that we still do not allow the removal for immediate model, that is no consistency model. The module refcount may increase in this case if somebody disables and enables the patch several times. This should not cause any harm. With this change a call to try_module_get() is moved to __klp_enable_patch from klp_register_patch to make module reference counting symmetric (module_put() is in a patch disable path) and to allow to take a new reference to a disabled module when being enabled. Finally, we need to be very careful about possible races between klp_unregister_patch(), kobject_put() functions and operations on the related sysfs files. kobject_put(&patch->kobj) must be called without klp_mutex. Otherwise, it might be blocked by enabled_store() that needs the mutex as well. In addition, enabled_store() must check if the patch was not unregisted in the meantime. There is no need to do the same for other kobject_put() callsites at the moment. Their sysfs operations neither take the lock nor they access any data that might be freed in the meantime. There was an attempt to use kobjects the right way and prevent these races by design. But it made the patch definition more complicated and opened another can of worms. See https://lkml.kernel.org/r/1464018848-4303-1-git-send-email-pmladek@suse.com [Thanks to Petr Mladek for improving the commit message.] Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: change to a per-task consistency modelJosh Poimboeuf
Change livepatch to use a basic per-task consistency model. This is the foundation which will eventually enable us to patch those ~10% of security patches which change function or data semantics. This is the biggest remaining piece needed to make livepatch more generally useful. This code stems from the design proposal made by Vojtech [1] in November 2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Patches are applied on a per-task basis, when the task is deemed safe to switch over. When a patch is enabled, livepatch enters into a transition state where tasks are converging to the patched state. Usually this transition state can complete in a few seconds. The same sequence occurs when a patch is disabled, except the tasks converge from the patched state to the unpatched state. An interrupt handler inherits the patched state of the task it interrupts. The same is true for forked tasks: the child inherits the patched state of the parent. Livepatch uses several complementary approaches to determine when it's safe to patch tasks: 1. The first and most effective approach is stack checking of sleeping tasks. If no affected functions are on the stack of a given task, the task is patched. In most cases this will patch most or all of the tasks on the first try. Otherwise it'll keep trying periodically. This option is only available if the architecture has reliable stacks (HAVE_RELIABLE_STACKTRACE). 2. The second approach, if needed, is kernel exit switching. A task is switched when it returns to user space from a system call, a user space IRQ, or a signal. It's useful in the following cases: a) Patching I/O-bound user tasks which are sleeping on an affected function. In this case you have to send SIGSTOP and SIGCONT to force it to exit the kernel and be patched. b) Patching CPU-bound user tasks. If the task is highly CPU-bound then it will get patched the next time it gets interrupted by an IRQ. c) In the future it could be useful for applying patches for architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In this case you would have to signal most of the tasks on the system. However this isn't supported yet because there's currently no way to patch kthreads without HAVE_RELIABLE_STACKTRACE. 3. For idle "swapper" tasks, since they don't ever exit the kernel, they instead have a klp_update_patch_state() call in the idle loop which allows them to be patched before the CPU enters the idle state. (Note there's not yet such an approach for kthreads.) All the above approaches may be skipped by setting the 'immediate' flag in the 'klp_patch' struct, which will disable per-task consistency and patch all tasks immediately. This can be useful if the patch doesn't change any function or data semantics. Note that, even with this flag set, it's possible that some tasks may still be running with an old version of the function, until that function returns. There's also an 'immediate' flag in the 'klp_func' struct which allows you to specify that certain functions in the patch can be applied without per-task consistency. This might be useful if you want to patch a common function like schedule(), and the function change doesn't need consistency but the rest of the patch does. For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user must set patch->immediate which causes all tasks to be patched immediately. This option should be used with care, only when the patch doesn't change any function or data semantics. In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE may be allowed to use per-task consistency if we can come up with another way to patch kthreads. The /sys/kernel/livepatch/<patch>/transition file shows whether a patch is in transition. Only a single patch (the topmost patch on the stack) can be in transition at a given time. A patch can remain in transition indefinitely, if any of the tasks are stuck in the initial patch state. A transition can be reversed and effectively canceled by writing the opposite value to the /sys/kernel/livepatch/<patch>/enabled file while the transition is in progress. Then all the tasks will attempt to converge back to the original patch state. [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Ingo Molnar <mingo@kernel.org> # for the scheduler changes Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: store function sizesJosh Poimboeuf
For the consistency model we'll need to know the sizes of the old and new functions to determine if they're on the stacks of any tasks. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: use kstrtobool() in enabled_store()Josh Poimboeuf
The sysfs enabled value is a boolean, so kstrtobool() is a better fit for parsing the input string since it does the range checking for us. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: move patching functions into patch.cJosh Poimboeuf
Move functions related to the actual patching of functions and objects into a new patch.c file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: remove unnecessary object loaded checkJosh Poimboeuf
klp_patch_object()'s callers already ensure that the object is loaded, so its call to klp_is_object_loaded() is unnecessary. This will also make it possible to move the patching code into a separate file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: separate enabled and patched statesJosh Poimboeuf
Once we have a consistency model, patches and their objects will be enabled and disabled at different times. For example, when a patch is disabled, its loaded objects' funcs can remain registered with ftrace indefinitely until the unpatching operation is complete and they're no longer in use. It's less confusing if we give them different names: patches can be enabled or disabled; objects (and their funcs) can be patched or unpatched: - Enabled means that a patch is logically enabled (but not necessarily fully applied). - Patched means that an object's funcs are registered with ftrace and added to the klp_ops func stack. Also, since these states are binary, represent them with booleans instead of ints. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08livepatch: create temporary klp_update_patch_state() stubJosh Poimboeuf
Create temporary stubs for klp_update_patch_state() so we can add TIF_PATCH_PENDING to different architectures in separate patches without breaking build bisectability. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08stacktrace/x86: add function for detecting reliable stack tracesJosh Poimboeuf
For live patching and possibly other use cases, a stack trace is only useful if it can be assured that it's completely reliable. Add a new save_stack_trace_tsk_reliable() function to achieve that. Note that if the target task isn't the current task, and the target task is allowed to run, then it could be writing the stack while the unwinder is reading it, resulting in possible corruption. So the caller of save_stack_trace_tsk_reliable() must ensure that the task is either 'current' or inactive. save_stack_trace_tsk_reliable() relies on the x86 unwinder's detection of pt_regs on the stack. If the pt_regs are not user-mode registers from a syscall, then they indicate an in-kernel interrupt or exception (e.g. preemption or a page fault), in which case the stack is considered unreliable due to the nature of frame pointers. It also relies on the x86 unwinder's detection of other issues, such as: - corrupted stack data - stack grows the wrong way - stack walk doesn't reach the bottom - user didn't provide a large enough entries array Such issues are reported by checking unwind_error() and !unwind_done(). Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can determine at build time whether the function is implemented. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Ingo Molnar <mingo@kernel.org> # for the x86 changes Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-07Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "This includes a fix for lockups caused by incorrect nsecs related cleanup, and a capabilities check fix for timerfd" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC timerfd: Only check CAP_WAKE_ALARM when it is needed
2017-03-07Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A fix for KVM's scheduler clock which (erroneously) was always marked unstable, a fix for RT/DL load balancing, plus latency fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface sched/core: Fix pick_next_task() for RT,DL sched/fair: Make select_idle_cpu() more aggressive
2017-03-07Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This includes a fix for a crash if certain special addresses are kprobed, plus does a rename of two Kconfig variables that were a minor misnomer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
2017-03-07Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: - Change the new refcount_t warnings from WARN() to WARN_ONCE() - two ww_mutex fixes - plus a new lockdep self-consistency check for a bug that triggered in practice * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Adjust the lock number for stress test locking/lockdep: Add nest_lock integrity test locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests locking/refcounts: Change WARN() to WARN_ONCE()
2017-03-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fix from Eric Biederman: "This fixes a race between put_ucounts and get_ucounts that can cause a use after free. The fix works by simplifying the code and so there is not even a temptation to be clever and play spinlock vs atomic reference games" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucount: Remove the atomicity from ucount->count
2017-03-07Merge tag 'trace-v4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "There was some breakage with the changes for jump labels in the 4.11 merge window: - powerpc broke as jump labels uses the two LSB bits as flags in initialization. A check was added to make sure that all jump label entries were 4 bytes aligned, but powerpc didn't work that way for modules. Adding an alignment in the module linker script appeared to be the best solution. - Jump labels also added an anonymous union to access those LSB bits as a normal long. But because this structure had static initialization, it broke older compilers that could not statically initialize anonymous unions without brackets. - The command line parameter for setting function graph filter broke the "EMPTY_HASH" descriptor by modifying it instead of creating a new hash to hold the entries. - The command line parameter ftrace_graph_max_depth was added to allow its setting at boot time. It uses existing code and only the command line hook was added. This is not really a fix, but as it uses existing code without affecting anything else, I added it to this release. It was ready before the merge window closed, but I wanted to let it sit in linux-next for a couple of days first" * tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/graph: Add ftrace_graph_max_depth kernel parameter tracing: Add #undef to fix compile error jump_label: Add comment about initialization order for anonymous unions jump_label: Fix anonymous union initialization module: set __jump_table alignment to 8 ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter tracing: Fix code comment for ftrace_ops_get_func()
2017-03-07jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSECFrederic Weisbecker
commit 93825f2ec736 converted NSEC_PER_SEC to TICK_NSEC because the author confused NSEC_PER_JIFFY with NSEC_PER_SEC. As a result, the calculation of refined jiffies got broken, triggering lockups. Fixes: 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY") Reported-and-tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1488880534-3777-1-git-send-email-fweisbec@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-07Merge tag 'perf-core-for-mingo-4.11-20170306' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: New features: - Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis) E.g.: # perf report -s symbol_size,symbol Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623 Overhead Symbol size Symbol 14.55% 326 [k] flush_tlb_mm_range 7.20% 1045 [k] filemap_map_pages 5.82% 124 [k] vma_interval_tree_insert 5.18% 2430 [k] unmap_page_range 2.57% 571 [k] vma_interval_tree_remove 1.94% 494 [k] page_add_file_rmap 1.82% 740 [k] page_remove_rmap 1.66% 1017 [k] release_pages 1.57% 1636 [k] update_blocked_averages 1.57% 76 [k] unlock_page - Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim) Change in behaviour: - Make system wide (-a) the default option if no target was specified and one of following conditions is met: - No workload specified (current behaviour) - A workload is specified but all requested events are system wide ones, like uncore ones. (Jiri Olsa) Fixes: - Add missing initialization to the instruction decoder used in the intel PT/BTS code, which was causing lots of failures in 'perf test', looking for a value when there was none (Adrian Hunter) Infrastructure changes: - Add arch code needed to adopt the kernel's refcount_t to aid in catching bugs when using atomic_t as a reference counter, basically cmpxchg related functions (Arnaldo Carvalho de Melo) - Convert the code using atomic_t as reference counts to refcount_t (Elena Rashetova) - Add feature test for sched_getcpu() to more easily check for its presence in the many libc implementations and accross different versions of such C libraries (Arnaldo Carvalho de Melo) - Issue a HW watchdog disable hint in 'perf stat' for when some of the requested events can't get counted because a PMU counter is taken by that watchdog (Borislav Petkov). - Add mapping for Intel's KnightsMill PMU events (Karol Wachowski) Documentation changes: - Clarify the term 'convergence' in: perf bench numa numa-mem -h --show_convergence (Jiri Olsa) Kernel code changes: - Ensure probe location is at function entry in kretprobes (Naveen N. Rao) - Allow return probes with offsets and absolute addresses (Naveen N. Rao) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-06ucount: Remove the atomicity from ucount->countEric W. Biederman
Always increment/decrement ucount->count under the ucounts_lock. The increments are there already and moving the decrements there means the locking logic of the code is simpler. This simplification in the locking logic fixes a race between put_ucounts and get_ucounts that could result in a use-after-free because the count could go zero then be found by get_ucounts and then be freed by put_ucounts. A bug presumably this one was found by a combination of syzkaller and KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov spotted the race in the code. Cc: stable@vger.kernel.org Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user") Reported-by: JongHwan Kim <zzoru007@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-03-06workqueue: use setup_deferrable_timerGeliang Tang
Use setup_deferrable_timer() instead of init_timer_deferrable() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06workqueue: trigger WARN if queue_delayed_work() is called with NULL @wqTejun Heo
If queue_delayed_work() gets called with NULL @wq, the kernel will oops asynchronuosly on timer expiration which isn't too helpful in tracking down the offender. This actually happened with smc. __queue_delayed_work() already does several input sanity checks synchronously. Add NULL @wq check. Reported-by: Dave Jones <davej@codemonkey.org.uk> Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroups: censor kernel pointer in debug filesKees Cook
As found in grsecurity, this avoids exposing a kernel pointer through the cgroup debug entries. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06cgroup/pids: remove spurious suspicious RCU usage warningTejun Heo
pids_can_fork() is special in that the css association is guaranteed to be stable throughout the function and thus doesn't need RCU protection around task_css access. When determining the css to charge the pid, task_css_check() is used to override the RCU sanity check. While adding a warning message on fork rejection from pids limit, 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") incorrectly added a task_css access which is neither RCU protected or explicitly annotated. This triggers the following suspicious RCU usage warning when RCU debugging is enabled. cgroup: fork rejected by pids controller in =============================== [ ERR: suspicious RCU usage. ] 4.10.0-work+ #1 Not tainted ------------------------------- ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by bash/1748: #0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0 stack backtrace: CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 Call Trace: dump_stack+0x68/0x93 lockdep_rcu_suspicious+0xd7/0x110 pids_can_fork+0x1c7/0x1d0 cgroup_can_fork+0x67/0xc0 copy_process.part.58+0x1709/0x1e90 _do_fork+0xe6/0x6e0 SyS_clone+0x19/0x20 do_syscall_64+0x5c/0x140 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f7853fab93a RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700 R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4 R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d /asdf There's no reason to dereference task_css again here when the associated css is already available. Fix it by replacing the task_cgroup() call with css->cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Fixes: 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") Cc: Kenny Yu <kennyyu@fb.com> Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06kernel: convert cgroup_namespace.count from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-05bpf: add get_next_key callback to LPM mapAlexei Starovoitov
map_get_next_key callback is mandatory. Supply dummy handler. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-06prlimit,security,selinux: add a security hook for prlimitStephen Smalley
When SELinux was first added to the kernel, a process could only get and set its own resource limits via getrlimit(2) and setrlimit(2), so no MAC checks were required for those operations, and thus no security hooks were defined for them. Later, SELinux introduced a hook for setlimit(2) with a check if the hard limit was being changed in order to be able to rely on the hard limit value as a safe reset point upon context transitions. Later on, when prlimit(2) was added to the kernel with the ability to get or set resource limits (hard or soft) of another process, LSM/SELinux was not updated other than to pass the target process to the setrlimit hook. This resulted in incomplete control over both getting and setting the resource limits of another process. Add a new security_task_prlimit() hook to the check_prlimit_permission() function to provide complete mediation. The hook is only called when acting on another task, and only if the existing DAC/capability checks would allow access. Pass flags down to the hook to indicate whether the prlimit(2) call will read, write, or both read and write the resource limits of the target process. The existing security_task_setrlimit() hook is left alone; it continues to serve a purpose in supporting the ability to make decisions based on the old and/or new resource limit values when setting limits. This is consistent with the DAC/capability logic, where check_prlimit_permission() performs generic DAC/capability checks for acting on another task, while do_prlimit() performs a capability check based on a comparison of the old and new resource limits. Fix the inline documentation for the hook to match the code. Implement the new hook for SELinux. For setting resource limits, we reuse the existing setrlimit permission. Note that this does overload the setrlimit permission to mean the ability to set the resource limit (soft or hard) of another process or the ability to change one's own hard limit. For getting resource limits, a new getrlimit permission is defined. This was not originally defined since getrlimit(2) could only be used to obtain a process' own limits. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-05cpufreq: schedutil: Pass sg_policy to get_next_freq()Viresh Kumar
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of get_next_freq() already have. Pass sg_policy instead of sg_cpu to get_next_freq(), to make it more efficient. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05cpufreq: schedutil: move cached_raw_freq to struct sugov_policyViresh Kumar
cached_raw_freq applies to the entire cpufreq policy and not individual CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it in struct sugov_cpu as we may end up comparing next_freq with a stale cached_raw_freq of a random CPU. Move cached_raw_freq to struct sugov_policy. Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency) Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix double-free in batman-adv, from Sven Eckelmann. 2) Fix packet stats for fast-RX path, from Joannes Berg. 3) Netfilter's ip_route_me_harder() doesn't handle request sockets properly, fix from Florian Westphal. 4) Fix sendmsg deadlock in rxrpc, from David Howells. 5) Add missing RCU locking to transport hashtable scan, from Xin Long. 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel. 7) Fix race in NAPI handling between poll handlers and busy polling, from Eric Dumazet. 8) TX path in vxlan and geneve need proper RCU locking, from Jakub Kicinski. 9) SYN processing in DCCP and TCP need to disable BH, from Eric Dumazet. 10) Properly handle net_enable_timestamp() being invoked from IRQ context, also from Eric Dumazet. 11) Fix crash on device-tree systems in xgene driver, from Alban Bedel. 12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de Melo. 13) Fix use-after-free in netvsc driver, from Dexuan Cui. 14) Fix max MTU setting in bonding driver, from WANG Cong. 15) xen-netback hash table can be allocated from softirq context, so use GFP_ATOMIC. From Anoob Soman. 16) Fix MAC address change bug in bgmac driver, from Hari Vyas. 17) strparser needs to destroy strp_wq on module exit, from WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) strparser: destroy workqueue on module exit sfc: fix IPID endianness in TSOv2 sfc: avoid max() in array size rds: remove unnecessary returned value check rxrpc: Fix potential NULL-pointer exception nfp: correct DMA direction in XDP DMA sync nfp: don't tell FW about the reserved buffer space net: ethernet: bgmac: mac address change bug net: ethernet: bgmac: init sequence bug xen-netback: don't vfree() queues under spinlock xen-netback: keep a local pointer for vif in backend_disconnect() netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups netfilter: nf_conntrack_sip: fix wrong memory initialisation can: flexcan: fix typo in comment can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer can: gs_usb: fix coding style can: gs_usb: Don't use stack memory for USB transfers ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines ixgbe: update the rss key on h/w, when ethtool ask for it ...
2017-03-03trace/kprobes: Add back warning about offset in return probesSteven Rostedt (VMware)
Let's not remove the warning about offsets and return probes when the offset is invalid. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20170227115204.00f92846@gandalf.local.home Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03trace/kprobes: Allow return probes with offsets and absolute addressesNaveen N. Rao
Since the kernel includes many non-global functions with same names, we will need to use offsets from other symbols (typically _text/_stext) or absolute addresses to place return probes on specific functions. Also, the core register_kretprobe() API never forbid use of offsets or absolute addresses with kretprobes. Allow its use with the trace infrastructure. To distinguish kernels that support this, update ftrace README to explicitly call this out. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/183e7ce2921a08c9c755ee9a5da3134febc6695b.1487770934.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03kretprobes: Ensure probe location is at function entryNaveen N. Rao
kretprobes can be registered by specifying an absolute address or by specifying offset to a symbol. However, we need to ensure this falls at function entry so as to be able to determine the return address. Validate the same during kretprobe registration. By default, there should not be any offset from a function entry, as determined through a kallsyms_lookup(). Introduce arch_function_offset_within_entry() as a way for architectures to override this. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/f1583bc4839a3862cfc2acefcc56f9c8837fa2ba.1487770934.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-03ftrace/graph: Add ftrace_graph_max_depth kernel parameterTodd Brandt
Early trace callgraphs can be extremely large on systems with several seconds of boot time. The max_depth parameter limits how deep the graph trace goes and reduces the output size. This parameter is the same as the max_graph_depth file in tracefs. Link: http://lkml.kernel.org/r/1488499935-23216-1-git-send-email-todd.e.brandt@linux.intel.com Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com> [ changed comments about debugfs to tracefs ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-03ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filterSteven Rostedt (VMware)
On boot up, if the kernel command line sets a graph funtion with the kernel command line options "ftrace_graph_filter" or "ftrace_graph_notrace" then it updates the corresponding function graph hash, ftrace_graph_hash or ftrace_graph_notrace_hash respectively. Unfortunately, at boot up, these variables are pointers to the "EMPTY_HASH" which is a constant used as a placeholder when a hash has no entities. The problem was that the comand line version to set the hashes updated the actual EMPTY_HASH instead of creating a new hash for the function graph. This broke the EMPTY_HASH because not only did it modify a constant (not sure how that was allowed to happen, except maybe because it was done at early boot, const variables were still mutable), but it made the filters have functions listed in them when they were actually empty. The kernel command line function needs to allocate a new hash for the function graph filters and assign the necessary variables to that new hash instead. Link: http://lkml.kernel.org/r/1488420091.7212.17.camel@linux.intel.com Cc: Namhyung Kim <namhyung@kernel.org> Fixes: b9b0c831bed2 ("ftrace: Convert graph filter to use hash tables") Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-02Merge tag 'pm-extra-4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates deom Rafael Wysocki: "These fix two bugs introduced by recent power management updates (in the cpuidle menu governor and intel_pstate) and a few other issues, clean up things and remove unused code. Specifics: - Fix for a cpuidle menu governor problem that started to take an unnecessary spinlock after one of the recent updates and that did not play well with the RT patch (Rafael Wysocki). - Fix for the new intel_pstate operation mode switching feature added recently that did not reinitialize P-state limits properly when switching operation modes (Rafael Wysocki). - Removal of unused global notifiers from the PM QoS framework (Viresh Kumar). - Generic power domains framework update to make it handle asynchronous invocations of PM callbacks in the "noirq" phases of system suspend/hibernation correctly (Ulf Hansson). - Two hibernation core cleanups (Rafael Wysocki). - intel_idle cleanup related to the sysfs interface (Len Brown). - Off-by-one bug fix in the OPP (Operating Performance Points) framework (Andrzej Hajda). - OPP framework's documentation fix (Viresh Kumar). - cpufreq qoriq driver cleanup (Tang Yuantian). - Fixes for typos in comments in the device runtime PM framework (Christophe Jaillet)" * tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / OPP: Documentation: Fix opp-microvolt in examples intel_idle: stop exposing platform acronyms in sysfs cpufreq: intel_pstate: Fix limits issue with operation mode switching PM / hibernate: Define pr_fmt() and use pr_*() instead of printk() PM / hibernate: Untangle power_down() cpuidle: menu: Avoid taking spinlock for accessing QoS values PM / QoS: Remove global notifiers PM / runtime: Fix some typos cpufreq: qoriq: clean up unused code PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop PM / Domains: Power off masters immediately in the power off sequence PM / Domains: Rename is_async to one_dev_on for genpd_power_off() PM / Domains: Move genpd_power_off() above genpd_power_on()
2017-03-03sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>Ingo Molnar
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h> needs it to pick up STACK_END_MAGIC. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>Ingo Molnar
This is a stray header that is not needed by anything in sched.h, so remove it. Update files that relied on the stray inclusion. This reduces the size of the header dependency graph. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Move cputime functionality from <linux/sched.h> and ↵Ingo Molnar
<linux/cputime.h> into <linux/sched/cputime.h> Move cputime related functionality out of <linux/sched.h>, as most code that includes <linux/sched.h> does not use that functionality. Move data types that are not included in task_struct directly to the signal definitions, into <linux/sched/signal.h>. Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h>Ingo Molnar
The task_lock()/task_unlock() APIs are not realated to core scheduling, they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>. Move them. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers, RCU: Move rcu_copy_process() from <linux/sched/task.h> to ↵Ingo Molnar
kernel/fork.c Move rcu_copy_process() into kernel/fork.c, which is the only user of this inline function. This simplifies <linux/sched/task.h> to the level that <linux/sched.h> does not have to be included in it anymore - which change is done in a subsequent patch. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03sched/headers: Move task_struct::signal and task_struct::sighand types and ↵Ingo Molnar
accessors into <linux/sched/signal.h> task_struct::signal and task_struct::sighand are pointers, which would normally make it straightforward to not define those types in sched.h. That is not so, because the types are accompanied by a myriad of APIs (macros and inline functions) that dereference them. Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>. With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore, trying to put accessors into sched.h as a test fails the following way: ./include/linux/sched.h: In function ‘test_signal_types’: ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’ ^ This reduces the size and complexity of sched.h significantly. Update all headers and .c code that relied on getting the signal handling functionality from <linux/sched.h> to include <linux/sched/signal.h>. The list of affected files in the preparatory patch was partly generated by grepping for the APIs, and partly by doing coverage build testing, both all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of cross-architecture builds. Nevertheless some (trivial) build breakage is still expected related to rare Kconfig combinations and in-flight patches to various kernel code, but most of it should be handled by this patch. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'pm-sleep'Rafael J. Wysocki
* pm-cpuidle: intel_idle: stop exposing platform acronyms in sysfs cpuidle: menu: Avoid taking spinlock for accessing QoS values * pm-cpufreq: cpufreq: intel_pstate: Fix limits issue with operation mode switching cpufreq: qoriq: clean up unused code * pm-sleep: PM / hibernate: Define pr_fmt() and use pr_*() instead of printk() PM / hibernate: Untangle power_down()
2017-03-02locking/ww_mutex: Adjust the lock number for stress testBoqun Feng
Because there are only 12 bits in held_lock::references, so we only support 4095 nested lock held in the same time, adjust the lock number for ww_mutex stress test to kill one lockdep splat: [ ] [ BUG: bad unlock balance detected! ] [ ] kworker/u2:0/5 is trying to release lock (ww_class_mutex) at: [ ] ww_mutex_unlock() [ ] but there are no more locks to release! ... Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170301150138.hdixnmafzfsox7nn@tardis.cn.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02locking/lockdep: Add nest_lock integrity testPeter Zijlstra
Boqun reported that hlock->references can overflow. Add a debug test for that to generate a clear error when this happens. Without this, lockdep is likely to report a mysterious failure on unlock. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02locking/ww_mutex: Replace cpu_relax() with cond_resched() for testsChris Wilson
When busy-spinning on a ww_mutex_trylock(), we depend upon the other thread advancing and releasing the lock. This can not happen on a single CPU unless we relinquish it: [ ] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:18] ... [ ] Call Trace: [ ] mutex_trylock() [ ] test_mutex_work+0x31/0x56 [ ] process_one_work+0x1b4/0x2f9 [ ] worker_thread+0x1b0/0x27c [ ] kthread+0xd1/0xd3 [ ] ret_from_fork+0x19/0x30 Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: f2a5fec17395 ("locking/ww_mutex: Begin kselftests for ww_mutex") Link: http://lkml.kernel.org/r/20170228094011.2595-1-chris@chris-wilson.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/core: Fix pick_next_task() for RT,DLPeter Zijlstra
Pavan noticed that the following commit: 49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class") ... broke RT,DL balancing by robbing them of the opportinty to do new-'idle' balancing when their last runnable task (on that runqueue) goes away. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: 49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/fair: Make select_idle_cpu() more aggressivePeter Zijlstra
Kitsunyan reported desktop latency issues on his Celeron 887 because of commit: 1b568f0aabf2 ("sched/core: Optimize SCHED_SMT") ... even though his CPU doesn't do SMT. The effect of running the SMT code on a !SMT part is basically a more aggressive select_idle_cpu(). Removing the avg condition fixed things for him. I also know FB likes this test gone, even though other workloads like having it. For now, take it out by default, until we get a better idea. Reported-by: kitsunyan <kitsunyan@inbox.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move the get_task_struct()/put_task_struct() and ↵Ingo Molnar
related APIs from <linux/sched.h> to <linux/sched/task.h> But first update usage sites with the new header dependency. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move _init() prototypes from <linux/sched.h> to ↵Ingo Molnar
<linux/sched/init.h> But first introduce a trivial header and update usage sites. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>