summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-12-12prctl: remove one-shot limitation for changing exe linkStanislav Kinsburskiy
This limitation came with the reason to remove "another way for malicious code to obscure a compromised program and masquerade as a benign process" by allowing "security-concious program can use this prctl once during its early initialization to ensure the prctl cannot later be abused for this purpose": http://marc.info/?l=linux-kernel&m=133160684517468&w=2 This explanation doesn't look sufficient. The only thing "exe" link is indicating is the file, used to execve, which is basically nothing and not reliable immediately after process has returned from execve system call. Moreover, to use this feture, all the mappings to previous exe file have to be unmapped and all the new exe file permissions must be satisfied. Which means, that changing exe link is very similar to calling execve on the binary. The need to remove this limitations comes from migration of NFS mount point, which is not accessible during restore and replaced by other file system. Because of this exe link has to be changed twice. [akpm@linux-foundation.org: fix up comment] Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: John Stultz <john.stultz@linaro.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12kthread: add __printf attributesNicolas Iooss
When commit fbae2d44aa1d ("kthread: add kthread_create_worker*()") introduced some kthread_create_...() functions which were taking printf-like parametter, it introduced __printf attributes to some functions (e.g. kthread_create_worker()). Nevertheless some new functions were forgotten (they have been detected thanks to -Wmissing-format-attribute warning flag). Add the missing __printf attributes to the newly-introduced functions in order to detect formatting issues at build-time with -Wformat flag. Link: http://lkml.kernel.org/r/20161126193543.22672-1-nicolas.iooss_linux@m4x.org Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull hotplug API fix from Ingo Molnar: "Late breaking fix from the v4.9 cycle: fix a hotplug register/ unregister notifier API asymmetry bug that can cause kernel warnings (and worse) with certain Kconfig combinations" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hotplug: Make register and unregister notifier API symmetric
2016-12-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main scheduler changes in this cycle were: - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a notion of 'better cores', which the scheduler will prefer to schedule single threaded workloads on. (Tim Chen, Srinivas Pandruvada) - enhance the handling of asymmetric capacity CPUs further (Morten Rasmussen) - improve/fix load handling when moving tasks between task groups (Vincent Guittot) - simplify and clean up the cputime code (Stanislaw Gruszka) - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent Guittot) - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov) - add uaccess atomicity debugging (when using access_ok() in the wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra) - implement various fixes, cleanups and other enhancements (Daniel Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched/core: Use load_avg for selecting idlest group sched/core: Fix find_idlest_group() for fork kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker() kthread: Don't use to_live_kthread() in kthread_[un]park() kthread: Don't use to_live_kthread() in kthread_stop() Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function" kthread: Make struct kthread kmalloc'ed x86/uaccess, sched/preempt: Verify access_ok() context sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h> cpufreq/intel_pstate: Use CPPC to get max performance acpi/bus: Set _OSC for diverse core support acpi/bus: Enable HWP CPPC objects x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU x86/sysctl: Add sysctl for ITMT scheduling feature x86: Enable Intel Turbo Boost Max Technology 3.0 x86/topology: Define x86's arch_update_cpu_topology sched: Extend scheduler's asym packing sched/fair: Clean up the tunable parameter definitions ...
2016-12-12Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The tree got pretty big in this development cycle, but the net effect is pretty good: 115 files changed, 673 insertions(+), 1522 deletions(-) The main changes were: - Rework and generalize the mutex code to remove per arch mutex primitives. (Peter Zijlstra) - Add vCPU preemption support: add an interface to query the preemption status of vCPUs and use it in locking primitives - this optimizes paravirt performance. (Pan Xinhui, Juergen Gross, Christian Borntraeger) - Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to clean up and improve the s390 lock yielding machinery and its core kernel impact. (Christian Borntraeger) - Micro-optimize mutexes some more. (Waiman Long) - Reluctantly add the to-be-deprecated mutex_trylock_recursive() interface on a temporary basis, to give the DRM code more time to get rid of its locking hacks. Any other users will be NAK-ed on sight. (We turned off the deprecation warning for the time being to not pollute the build log.) (Peter Zijlstra) - Improve the rtmutex code a bit, in light of recent long lived bugs/races. (Thomas Gleixner) - Misc fixes, cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/paravirt: Fix bool return type for PVOP_CALL() x86/paravirt: Fix native_patch() locking/ww_mutex: Use relaxed atomics locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked() locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted() locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock() Documentation/virtual/kvm: Support the vCPU preemption check x86/xen: Support the vCPU preemption check x86/kvm: Support the vCPU preemption check x86/kvm: Support the vCPU preemption check kvm: Introduce kvm_write_guest_offset_cached() locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests locking/spinlocks, s390: Implement vcpu_is_preempted(cpu) locking/core, powerpc: Implement vcpu_is_preempted(cpu) sched/core: Introduce the vcpu_is_preempted(cpu) interface sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q locking/core: Provide common cpu_relax_yield() definition locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily ...
2016-12-12Merge branch 'core-smp-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP bootup updates from Ingo Molnar: "Three changes to unify/standardize some of the bootup message printing in kernel/smp.c between architectures" * 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/smp: Tell the user we're bringing up secondary CPUs kernel/smp: Make the SMP boot message common on all arches kernel/smp: Define pr_fmt() for smp.c
2016-12-12Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this development cycle were: - Miscellaneous fixes, including a change to call_rcu()'s rcu_head alignment check. - Security-motivated list consistency checks, which are disabled by default behind DEBUG_LIST. - Torture-test updates. - Documentation updates, yet again just simple changes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: torture: Prevent jitter from delaying build-only runs torture: Remove obsolete files from rcutorture .gitignore rcu: Don't kick unless grace period or request rcu: Make expedited grace periods recheck dyntick idle state torture: Trace long read-side delays rcu: RCU_TRACE enables event tracing as well as debugfs rcu: Remove obsolete comment from __call_rcu() rcu: Remove obsolete rcu_check_callbacks() header comment rcu: Tighten up __call_rcu() rcu_head alignment check Documentation/RCU: Fix minor typo documentation: Present updated RCU guarantee bug: Avoid Kconfig warning for BUG_ON_DATA_CORRUPTION lib/Kconfig.debug: Fix typo in select statement lkdtm: Add tests for struct list corruption bug: Provide toggle for BUG on data corruption list: Split list_del() debug checking into separate function rculist: Consolidate DEBUG_LIST for list_add_rcu() list: Split list_add() debug checking into separate function
2016-12-11sched/core: Use load_avg for selecting idlest groupVincent Guittot
find_idlest_group() only compares the runnable_load_avg when looking for the least loaded group. But on fork intensive use case like hackbench where tasks blocked quickly after the fork, this can lead to selecting the same CPU instead of other CPUs, which have similar runnable load but a lower load_avg. When the runnable_load_avg of 2 CPUs are close, we now take into account the amount of blocked load as a 2nd selection factor. There is now 3 zones for the runnable_load of the rq: - [0 .. (runnable_load - imbalance)]: Select the new rq which has significantly less runnable_load - [(runnable_load - imbalance) .. (runnable_load + imbalance)]: The runnable loads are close so we use load_avg to chose between the 2 rq - [(runnable_load + imbalance) .. ULONG_MAX]: Keep the current rq which has significantly less runnable_load The scale factor that is currently used for comparing runnable_load, doesn't work well with small value. As an example, the use of a scaling factor fails as soon as this_runnable_load == 0 because we always select local rq even if min_runnable_load is only 1, which doesn't really make sense because they are just the same. So instead of scaling factor, we use an absolute margin for runnable_load to detect CPUs with similar runnable_load and we keep using scaling factor for blocked load. For use case like hackbench, this enable the scheduler to select different CPUs during the fork sequence and to spread tasks across the system. Tests have been done on a Hikey board (ARM based octo cores) for several kernel. The result below gives min, max, avg and stdev values of 18 runs with each configuration. The patches depend on the "no missing update_rq_clock()" work. hackbench -P -g 1 ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches min 0.049 0.050 0.051 0,048 avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) max 0.066 0.068 0.070 0,063 stdev +/-9% +/-9% +/-8% +/-9% More performance numbers here: https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: kernellwp@gmail.com Cc: umgwanakikbuti@gmail.com Cc: yuyang.du@intel.comc Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11sched/core: Fix find_idlest_group() for forkVincent Guittot
During fork, the utilization of a task is init once the rq has been selected because the current utilization level of the rq is used to set the utilization of the fork task. As the task's utilization is still 0 at this step of the fork sequence, it doesn't make sense to look for some spare capacity that can fit the task's utilization. Furthermore, I can see perf regressions for the test: hackbench -P -g 1 because the least loaded policy is always bypassed and tasks are not spread during fork. With this patch and the fix below, we are back to same performances as for v4.8. The fix below is only a temporary one used for the test until a smarter solution is found because we can't simply remove the test which is useful for others benchmarks | @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t | | avg_cost = this_sd->avg_scan_cost; | | - /* | - * Due to large variance we need a large fuzz factor; hackbench in | - * particularly is sensitive here. | - */ | - if ((avg_idle / 512) < avg_cost) | - return -1; | - | time = local_clock(); | | for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: kernellwp@gmail.com Cc: umgwanakikbuti@gmail.com Cc: yuyang.du@intel.comc Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-12-08bpf: xdp: Allow head adjustment in XDP progMartin KaFai Lau
This patch allows XDP prog to extend/remove the packet data at the head (like adding or removing header). It is done by adding a new XDP helper bpf_xdp_adjust_head(). It also renames bpf_helper_changes_skb_data() to bpf_helper_changes_pkt_data() to better reflect that XDP prog does not work on skb. This patch adds one "xdp_adjust_head" bit to bpf_prog for the XDP-capable driver to check if the XDP prog requires bpf_xdp_adjust_head() support. The driver can then decide to error out during XDP_SETUP_PROG. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08bpf: fix state equivalenceAlexei Starovoitov
Commmits 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") and 484611357c19 ("bpf: allow access into map value arrays") by themselves are correct, but in combination they make state equivalence ignore 'id' field of the register state which can lead to accepting invalid program. Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()Oleg Nesterov
kthread_create_on_cpu() sets KTHREAD_IS_PER_CPU and kthread->cpu, this only makes sense if this kthread can be parked/unparked by cpuhp code. kthread workers never call kthread_parkme() so this has no effect. Change __kthread_create_worker() to simply call kthread_bind(task, cpu). The very fact that kthread_create_on_cpu() doesn't accept a generic fmt shows that it should not be used outside of smpboot.c. Now, the only reason we can not unexport this helper and move it into smpboot.c is that it sets kthread->cpu and struct kthread is not exported. And the only reason we can not kill kthread->cpu is that kthread_unpark() is used by drivers/gpu/drm/amd/scheduler/gpu_scheduler.c and thus we can not turn _unpark into kthread_unpark(struct smp_hotplug_thread *, cpu). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Petr Mladek <pmladek@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175110.GA5342@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08kthread: Don't use to_live_kthread() in kthread_[un]park()Oleg Nesterov
Now that to_kthread() is always validm change kthread_park() and kthread_unpark() to use it and kill to_live_kthread(). The conversion of kthread_unpark() is trivial. If KTHREAD_IS_PARKED is set then the task has called complete(&self->parked) and there the function cannot race against a concurrent kthread_stop() and exit. kthread_park() is more tricky, because its semantics are not well defined. It returns -ENOSYS if the thread exited but this can never happen and as Roman pointed out kthread_park() can obviously block forever if it would race with the exiting kthread. The usage of kthread_park() in cpuhp code (cpu.c, smpboot.c, stop_machine.c) is fine. It can never see an exiting/exited kthread, smpboot_destroy_threads() clears *ht->store, smpboot_park_thread() checks it is not NULL under the same smpboot_threads_lock. cpuhp_threads and cpu_stop_threads never exit, so other callers are fine too. But it has two more users: - watchdog_park_threads(): The code is actually correct, get_online_cpus() ensures that kthread_park() can't race with itself (note that kthread_park() can't handle this race correctly), but it should not use kthread_park() directly. - drivers/gpu/drm/amd/scheduler/gpu_scheduler.c should not use kthread_park() either. kthread_park() must not be called after amd_sched_fini() which does kthread_stop(), otherwise even to_live_kthread() is not safe because task_struct can be already freed and sched->thread can point to nowhere. The usage of kthread_park/unpark should either be restricted to core code which is properly protected against the exit race or made more robust so it is safe to use it in drivers. To catch eventual exit issues, add a WARN_ON(PF_EXITING) for now. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175107.GA5339@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08kthread: Don't use to_live_kthread() in kthread_stop()Oleg Nesterov
kthread_stop() had to use to_live_kthread() simply because it was not possible to access kthread->exited after the exiting task clears task_struct->vfork_done. Now that to_kthread() is always valid, wake_up_process() + wait_for_completion() can be done ununconditionally. It's not an issue anymore if the task has already issued complete_vfork_done() or died. The exiting task can get the spurious wakeup after mm_release() but this is possible without this change too and is fine; do_task_dead() ensures that this can't make any harm. As a further enhancement this could be converted to task_work_add() later, so ->vfork_done can be avoided completely. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175103.GA5336@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in ↵Oleg Nesterov
to_live_kthread() function" This reverts commit 23196f2e5f5d810578a772785807dcdc2b9fdce9. Now that struct kthread is kmalloc'ed and not longer on the task stack there is no need anymore to pin the stack. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175100.GA5333@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08kthread: Make struct kthread kmalloc'edOleg Nesterov
commit 23196f2e5f5d "kthread: Pin the stack via try_get_task_stack() / put_task_stack() in to_live_kthread() function" is a workaround for the fragile design of struct kthread being allocated on the task stack. struct kthread in its current form should be removed, but this needs cleanups outside of kthread.c. As a first step move struct kthread away from the task stack by making it kmalloc'ed. This allows to access kthread.exited without the magic of trying to pin task stack and the try logic in to_live_kthread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175057.GA5330@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08hotplug: Make register and unregister notifier API symmetricMichal Hocko
Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its notifiers when HOTPLUG_CPU=y while the registration might succeed even when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap might keep a stale notifier on the list on the manual clean up during the pool tear down and thus corrupt the list. Resulting in the following [ 144.964346] BUG: unable to handle kernel paging request at ffff880658a2be78 [ 144.971337] IP: [<ffffffffa290b00b>] raw_notifier_chain_register+0x1b/0x40 <snipped> [ 145.122628] Call Trace: [ 145.125086] [<ffffffffa28e5cf8>] __register_cpu_notifier+0x18/0x20 [ 145.131350] [<ffffffffa2a5dd73>] zswap_pool_create+0x273/0x400 [ 145.137268] [<ffffffffa2a5e0fc>] __zswap_param_set+0x1fc/0x300 [ 145.143188] [<ffffffffa2944c1d>] ? trace_hardirqs_on+0xd/0x10 [ 145.149018] [<ffffffffa2908798>] ? kernel_param_lock+0x28/0x30 [ 145.154940] [<ffffffffa2a3e8cf>] ? __might_fault+0x4f/0xa0 [ 145.160511] [<ffffffffa2a5e237>] zswap_compressor_param_set+0x17/0x20 [ 145.167035] [<ffffffffa2908d3c>] param_attr_store+0x5c/0xb0 [ 145.172694] [<ffffffffa290848d>] module_attr_store+0x1d/0x30 [ 145.178443] [<ffffffffa2b2b41f>] sysfs_kf_write+0x4f/0x70 [ 145.183925] [<ffffffffa2b2a5b9>] kernfs_fop_write+0x149/0x180 [ 145.189761] [<ffffffffa2a99248>] __vfs_write+0x18/0x40 [ 145.194982] [<ffffffffa2a9a412>] vfs_write+0xb2/0x1a0 [ 145.200122] [<ffffffffa2a9a732>] SyS_write+0x52/0xa0 [ 145.205177] [<ffffffffa2ff4d97>] entry_SYSCALL_64_fastpath+0x12/0x17 This can be even triggered manually by changing /sys/module/zswap/parameters/compressor multiple times. Fix this issue by making unregister APIs symmetric to the register so there are no surprises. Fixes: 47e627bc8c9a ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU") Reported-and-tested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Streetman <ddstreet@ieee.org> Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-07kcov: add missing #include <linux/sched.h>Kefeng Wang
In __sanitizer_cov_trace_pc we use task_struct and fields within it, but as we haven't included <linux/sched.h>, it is not guaranteed to be defined. While we usually happen to acquire the definition through a transitive include, this is fragile (and hasn't been true in the past, causing issues with backports). Include <linux/sched.h> to avoid any fragility. [mark.rutland@arm.com: rewrote changelog] Link: http://lkml.kernel.org/r/1481007384-27529-1-git-send-email-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-07Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "An autogroup nice level adjustment bug fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/autogroup: Fix 64-bit kernel nice level adjustment
2016-12-07Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "A bogus warning fix, a counter width handling fix affecting certain machines, plus a oneliner hw-enablement patch for Knights Mill CPUs" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Remove invalid warning from list_update_cgroup_even()t perf/x86: Fix full width counter, counter overflow perf/x86/intel: Enable C-state residency events for Knights Mill
2016-12-07Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two rtmutex race fixes (which miraculously never triggered, that we know of), plus two lockdep printk formatting regression fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix report formatting locking/rtmutex: Use READ_ONCE() in rt_mutex_owner() locking/rtmutex: Prevent dequeue vs. unlock race locking/selftest: Fix output since KERN_CONT changes
2016-12-07bpf: fix loading of BPF_MAXINSNS sized programsDaniel Borkmann
General assumption is that single program can hold up to BPF_MAXINSNS, that is, 4096 number of instructions. It is the case with cBPF and that limit was carried over to eBPF. When recently testing digest, I noticed that it's actually not possible to feed 4096 instructions via bpf(2). The check for > BPF_MAXINSNS was added back then to bpf_check() in cbd357008604 ("bpf: verifier (add ability to receive verification log)"). However, 09756af46893 ("bpf: expand BPF syscall with program load/unload") added yet another check that comes before that into bpf_prog_load(), but this time bails out already in case of >= BPF_MAXINSNS. Fix it up and perform the check early in bpf_prog_load(), so we can drop the second one in bpf_check(). It makes sense, because also a 0 insn program is useless and we don't want to waste any resources doing work up to bpf_check() point. The existing bpf(2) man page documents E2BIG as the official error for such cases, so just stick with it as well. Fixes: 09756af46893 ("bpf: expand BPF syscall with program load/unload") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07clocksource: export the clocks_calc_mult_shift to use by timestamp codeMurali Karicheri
The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and requires to know mult and shift factors for timestamp conversion from raw value to nanoseconds (ptp clock). Now these mult and shift factors are calculated manually and provided through DT, which makes very hard to support of a lot number of platforms, especially if CPTS refclk is not the same for some kind of boards and depends on efuse settings (Keystone 2 platforms). Hence, export clocks_calc_mult_shift() to allow drivers like CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of mult and shift factors. Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06lockdep: Fix report formattingDmitry Vyukov
Since commit: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines") printk() requires KERN_CONT to continue log messages. Lots of printk() in lockdep.c and print_ip_sym() don't have it. As the result lockdep reports are completely messed up. Add missing KERN_CONT and inline print_ip_sym() where necessary. Example of a messed up report: 0-rc5+ #41 Not tainted ------------------------------------------------------- syz-executor0/5036 is trying to acquire lock: ( rtnl_mutex ){+.+.+.} , at: [<ffffffff86b3d6ac>] rtnl_lock+0x1c/0x20 but task is already holding lock: ( &net->packet.sklist_lock ){+.+...} , at: [<ffffffff873541a6>] packet_diag_dump+0x1a6/0x1920 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 ( &net->packet.sklist_lock +.+...} ... Without this patch all scripts that parse kernel bug reports are broken. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andreyknvl@google.com Cc: aryabinin@virtuozzo.com Cc: joe@perches.com Cc: syzkaller@googlegroups.com Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-06perf/core: Remove invalid warning from list_update_cgroup_even()tDavid Carrillo-Cisneros
The warning introduced in commit: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups") assumed that a cgroup switch always precedes list_del_event. This is not the case. Remove warning. Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in or ctx->nr_cgroups == 0. Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Nilay Vaish <nilayvaish@gmail.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi V Shankar <ravi.v.shankar@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-05bpf: add prog_digest and expose it via fdinfo/netlinkDaniel Borkmann
When loading a BPF program via bpf(2), calculate the digest over the program's instruction stream and store it in struct bpf_prog's digest member. This is done at a point in time before any instructions are rewritten by the verifier. Any unstable map file descriptor number part of the imm field will be zeroed for the hash. fdinfo example output for progs: # cat /proc/1590/fdinfo/5 pos: 0 flags: 02000002 mnt_id: 11 prog_type: 1 prog_jited: 1 prog_digest: b27e8b06da22707513aa97363dfb11c7c3675d28 memlock: 4096 When programs are pinned and retrieved by an ELF loader, the loader can check the program's digest through fdinfo and compare it against one that was generated over the ELF file's program section to see if the program needs to be reloaded. Furthermore, this can also be exposed through other means such as netlink in case of a tc cls/act dump (or xdp in future), but also through tracepoints or other facilities to identify the program. Other than that, the digest can also serve as a base name for the work in progress kallsyms support of programs. The digest doesn't depend/select the crypto layer, since we need to keep dependencies to a minimum. iproute2 will get support for this facility. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05bpf: Preserve const register type on const OR alu opsGianluca Borello
Occasionally, clang (e.g. version 3.8.1) translates a sum between two constant operands using a BPF_OR instead of a BPF_ADD. The verifier is currently not handling this scenario, and the destination register type becomes UNKNOWN_VALUE even if it's still storing a constant. As a result, the destination register cannot be used as argument to a helper function expecting a ARG_CONST_STACK_*, limiting some use cases. Modify the verifier to handle this case, and add a few tests to make sure all combinations are supported, and stack boundaries are still verified even with BPF_OR. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Lots more phydev and probe error path leaks in various drivers by Johan Hovold. 2) Fix race in packet_set_ring(), from Philip Pettersson. 3) Use after free in dccp_invalid_packet(), from Eric Dumazet. 4) Signnedness overflow in SO_{SND,RCV}BUFFORCE, also from Eric Dumazet. 5) When tunneling between ipv4 and ipv6 we can be left with the wrong skb->protocol value as we enter the IPSEC engine and this causes all kinds of problems. Set it before the output path does any dst_output() calls, from Eli Cooper. 6) bcmgenet uses wrong device struct pointer in DMA API calls, fix from Florian Fainelli. 7) Various netfilter nat bug fixes from FLorian Westphal. 8) Fix memory leak in ipvlan_link_new(), from Gao Feng. 9) Locking fixes, particularly wrt. socket lookups, in l2tp from Guillaume Nault. 10) Avoid invoking rhash teardowns in atomic context by moving netlink cb->done() dump completion from a worker thread. Fix from Herbert Xu. 11) Buffer refcount problems in tun and macvtap on errors, from Jason Wang. 12) We don't set Kconfig symbol DEFAULT_TCP_CONG properly when the user selects BBR. Fix from Julian Wollrath. 13) Fix deadlock in transmit path on altera TSE driver, from Lino Sanfilippo. 14) Fix unbalanced reference counting in dsa_switch_tree, from Nikita Yushchenko. 15) tc_tunnel_key needs to be properly exported to userspace via uapi, fix from Roi Dayan. 16) rds_tcp_init_net() doesn't unregister notifier in error path, fix from Sowmini Varadhan. 17) Stale packet header pointer access after pskb_expand_head() in genenve driver, fix from Sabrina Dubroca. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits) net: avoid signed overflows for SO_{SND|RCV}BUFFORCE geneve: avoid use-after-free of skb->data tipc: check minimum bearer MTU net: renesas: ravb: unintialized return value sh_eth: remove unchecked interrupts for RZ/A1 net: bcmgenet: Utilize correct struct device for all DMA operations NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040 cdc_ether: Fix handling connection notification ip6_offload: check segs for NULL in ipv6_gso_segment. RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()" ipv6: Set skb->protocol properly for local output ipv4: Set skb->protocol properly for local output packet: fix race condition in packet_set_ring net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks net: ethernet: stmmac: platform: fix outdated function header net: ethernet: stmmac: dwmac-meson8b: fix probe error path net: ethernet: stmmac: dwmac-generic: fix probe error path ...
2016-12-02bpf: Add new cgroup attach type to enable sock modificationsDavid Ahern
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02bpf: Refactor cgroups code in prep for new typeDavid Ahern
Code move and rename only; no functional change intended. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02bpf: BPF for lightweight tunnel infrastructureThomas Graf
Registers new BPF program types which correspond to the LWT hooks: - BPF_PROG_TYPE_LWT_IN => dst_input() - BPF_PROG_TYPE_LWT_OUT => dst_output() - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit() The separate program types are required to differentiate between the capabilities each LWT hook allows: * Programs attached to dst_input() or dst_output() are restricted and may only read the data of an skb. This prevent modification and possible invalidation of already validated packet headers on receive and the construction of illegal headers while the IP headers are still being assembled. * Programs attached to lwtunnel_xmit() are allowed to modify packet content as well as prepending an L2 header via a newly introduced helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is invoked after the IP header has been assembled completely. All BPF programs receive an skb with L3 headers attached and may return one of the following error codes: BPF_OK - Continue routing as per nexthop BPF_DROP - Drop skb and return EPERM BPF_REDIRECT - Redirect skb to device as per redirect() helper. (Only valid in lwtunnel_xmit() context) The return codes are binary compatible with their TC_ACT_ relatives to ease compatibility. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02locking/rtmutex: Explain locking rules for ↵Thomas Gleixner
rt_mutex_proxy_unlock()/init_proxy_locked() While debugging the unlock vs. dequeue race which resulted in state corruption of futexes the lockless nature of rt_mutex_proxy_unlock() caused some confusion. Add commentry to explain why it is safe to do this lockless. Add matching comments to rt_mutex_init_proxy_locked() for completeness sake. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Daney <ddaney@caviumnetworks.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALLThomas Gleixner
This is a left over from the original rtmutex implementation which used both bit0 and bit1 in the owner pointer. Commit: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") ... removed the usage of bit1, but kept the extra mask around. This is confusing at best. Remove it and just use RT_MUTEX_HAS_WAITERS for the masking. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Daney <ddaney@caviumnetworks.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20161130210030.509567906@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02Merge branch 'locking/urgent' into locking/core, to pick up dependent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()Thomas Gleixner
While debugging the rtmutex unlock vs. dequeue race Will suggested to use READ_ONCE() in rt_mutex_owner() as it might race against the cmpxchg_release() in unlock_rt_mutex_safe(). Will: "It's a minor thing which will most likely not matter in practice" Careful search did not unearth an actual problem in todays code, but it's better to be safe than surprised. Suggested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Daney <ddaney@caviumnetworks.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02locking/rtmutex: Prevent dequeue vs. unlock raceThomas Gleixner
David reported a futex/rtmutex state corruption. It's caused by the following problem: CPU0 CPU1 CPU2 l->owner=T1 rt_mutex_lock(l) lock(l->wait_lock) l->owner = T1 | HAS_WAITERS; enqueue(T2) boost() unlock(l->wait_lock) schedule() rt_mutex_lock(l) lock(l->wait_lock) l->owner = T1 | HAS_WAITERS; enqueue(T3) boost() unlock(l->wait_lock) schedule() signal(->T2) signal(->T3) lock(l->wait_lock) dequeue(T2) deboost() unlock(l->wait_lock) lock(l->wait_lock) dequeue(T3) ===> wait list is now empty deboost() unlock(l->wait_lock) lock(l->wait_lock) fixup_rt_mutex_waiters() if (wait_list_empty(l)) { owner = l->owner & ~HAS_WAITERS; l->owner = owner ==> l->owner = T1 } lock(l->wait_lock) rt_mutex_unlock(l) fixup_rt_mutex_waiters() if (wait_list_empty(l)) { owner = l->owner & ~HAS_WAITERS; cmpxchg(l->owner, T1, NULL) ===> Success (l->owner = NULL) l->owner = owner ==> l->owner = T1 } That means the problem is caused by fixup_rt_mutex_waiters() which does the RMW to clear the waiters bit unconditionally when there are no waiters in the rtmutexes rbtree. This can be fatal: A concurrent unlock can release the rtmutex in the fastpath because the waiters bit is not set. If the cmpxchg() gets in the middle of the RMW operation then the previous owner, which just unlocked the rtmutex is set as the owner again when the write takes place after the successfull cmpxchg(). The solution is rather trivial: verify that the owner member of the rtmutex has the waiters bit set before clearing it. This does not require a cmpxchg() or other atomic operations because the waiters bit can only be set and cleared with the rtmutex wait_lock held. It's also safe against the fast path unlock attempt. The unlock attempt via cmpxchg() will either see the bit set and take the slowpath or see the bit cleared and release it atomically in the fastpath. It's remarkable that the test program provided by David triggers on ARM64 and MIPS64 really quick, but it refuses to reproduce on x86-64, while the problem exists there as well. That refusal might explain that this got not discovered earlier despite the bug existing from day one of the rtmutex implementation more than 10 years ago. Thanks to David for meticulously instrumenting the code and providing the information which allowed to decode this subtle problem. Reported-by: David Daney <ddaney@caviumnetworks.com> Tested-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core") Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-01audit: remove useless synchronize_net()WANG Cong
netlink kernel socket is protected by refcount, not RCU. Its rcv path is neither protected by RCU. So the synchronize_net() is just pointless. Cc: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30bpf: fix states equal logic for varlen accessJosef Bacik
If we have a branch that looks something like this int foo = map->value; if (condition) { foo += blah; } else { foo = bar; } map->array[foo] = baz; We will incorrectly assume that the !condition branch is equal to the condition branch as the register for foo will be UNKNOWN_VALUE in both cases. We need to adjust this logic to only do this if we didn't do a varlen access after we processed the !condition branch, otherwise we have different ranges and need to check the other branch as well. Fixes: 484611357c19 ("bpf: allow access into map value arrays") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29bpf: cgroup: fix documentation of __cgroup_bpf_update()Daniel Mack
There's a 'not' missing in one paragraph. Add it. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Daniel Mack <daniel@zonque.org> Reported-by: Rami Rosen <roszenrami@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29Re-enable CONFIG_MODVERSIONS in a slightly weaker formLinus Torvalds
This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC information in order to work around the issue that newer binutils versions seem to occasionally drop the CRC on the floor. binutils 2.26 seems to work fine, while binutils 2.27 seems to break MODVERSIONS of symbols that have been defined in assembler files. [ We've had random missing CRC's before - it may be an old problem that just is now reliably triggered with the weak asm symbols and a new version of binutils ] Some day I really do want to remove MODVERSIONS entirely. Sadly, today does not appear to be that day: Debian people apparently do want the option to enable MODVERSIONS to make it easier to have external modules across kernel versions, and this seems to be a fairly minimal fix for the annoying problem. Cc: Ben Hutchings <ben@decadent.org.uk> Acked-by: Michal Marek <mmarek@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-27bpf: allow for mount options to specify permissionsDaniel Borkmann
Since we recently converted the BPF filesystem over to use mount_nodev(), we now have the possibility to also hold mount options in sb's s_fs_info. This work implements mount options support for specifying permissions on the sb's inode, which will be used by tc when it manually needs to mount the fs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27bpf: add owner_prog_type and accounted mem to array map's fdinfoDaniel Borkmann
Allow for checking the owner_prog_type of a program array map. In some cases bpf(2) can return -EINVAL /after/ the verifier passed and did all the rewrites of the bpf program. The reason that lets us fail at this late stage is that program array maps are incompatible. Allow users to inspect this earlier after they got the map fd through BPF_OBJ_GET command. tc will get support for this. Also, display how much we charged the map with regards to RLIMIT_MEMLOCK. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27bpf: drop unnecessary context cast from BPF_PROG_RUNDaniel Borkmann
Since long already bpf_func is not only about struct sk_buff * as input anymore. Make it generic as void *, so that callers don't need to cast for it each time they call BPF_PROG_RUN(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25cgroup: add support for eBPF programsDaniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>