summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-05-08livepatch: Remove .klp.archPeter Zijlstra
After the previous patch, vmlinux-specific KLP relocations are now applied early during KLP module load. This means that .klp.arch sections are no longer needed for *vmlinux-specific* KLP relocations. One might think they're still needed for *module-specific* KLP relocations. If a to-be-patched module is loaded *after* its corresponding KLP module is loaded, any corresponding KLP relocations will be delayed until the to-be-patched module is loaded. If any special sections (.parainstructions, for example) rely on those relocations, their initializations (apply_paravirt) need to be done afterwards. Thus the apparent need for arch_klp_init_object_loaded() and its corresponding .klp.arch sections -- it allows some of the special section initializations to be done at a later time. But... if you look closer, that dependency between the special sections and the module-specific KLP relocations doesn't actually exist in reality. Looking at the contents of the .altinstructions and .parainstructions sections, there's not a realistic scenario in which a KLP module's .altinstructions or .parainstructions section needs to access a symbol in a to-be-patched module. It might need to access a local symbol or even a vmlinux symbol; but not another module's symbol. When a special section needs to reference a local or vmlinux symbol, a normal rela can be used instead of a KLP rela. Since the special section initializations don't actually have any real dependency on module-specific KLP relocations, .klp.arch and arch_klp_init_object_loaded() no longer have a reason to exist. So remove them. As Peter said much more succinctly: So the reason for .klp.arch was that .klp.rela.* stuff would overwrite paravirt instructions. If that happens you're doing it wrong. Those RELAs are core kernel, not module, and thus should've happened in .rela.* sections at patch-module loading time. Reverting this removes the two apply_{paravirt,alternatives}() calls from the late patching path, and means we don't have to worry about them when removing module_disable_ro(). [ jpoimboe: Rewrote patch description. Tweaked klp_init_object_loaded() error path. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08livepatch: Apply vmlinux-specific KLP relocations earlyJosh Poimboeuf
KLP relocations are livepatch-specific relocations which are applied to a KLP module's text or data. They exist for two reasons: 1) Unexported symbols: replacement functions often need to access unexported symbols (e.g. static functions), which "normal" relocations don't allow. 2) Late module patching: this is the ability for a KLP module to bypass normal module dependencies, such that the KLP module can be loaded *before* a to-be-patched module. This means that relocations which need to access symbols in the to-be-patched module might need to be applied to the KLP module well after it has been loaded. Non-late-patched KLP relocations are applied from the KLP module's init function. That usually works fine, unless the patched code wants to use alternatives, paravirt patching, jump tables, or some other special section which needs relocations. Then we run into ordering issues and crashes. In order for those special sections to work properly, the KLP relocations should be applied *before* the special section init code runs, such as apply_paravirt(), apply_alternatives(), or jump_label_apply_nops(). You might think the obvious solution would be to move the KLP relocation initialization earlier, but it's not necessarily that simple. The problem is the above-mentioned late module patching, for which KLP relocations can get applied well after the KLP module is loaded. To "fix" this issue in the past, we created .klp.arch sections: .klp.arch.{module}..altinstructions .klp.arch.{module}..parainstructions Those sections allow KLP late module patching code to call apply_paravirt() and apply_alternatives() after the module-specific KLP relocations (.klp.rela.{module}.{section}) have been applied. But that has a lot of drawbacks, including code complexity, the need for arch-specific code, and the (per-arch) danger that we missed some special section -- for example the __jump_table section which is used for jump labels. It turns out there's a simpler and more functional approach. There are two kinds of KLP relocation sections: 1) vmlinux-specific KLP relocation sections .klp.rela.vmlinux.{sec} These are relocations (applied to the KLP module) which reference unexported vmlinux symbols. 2) module-specific KLP relocation sections .klp.rela.{module}.{sec}: These are relocations (applied to the KLP module) which reference unexported or exported module symbols. Up until now, these have been treated the same. However, they're inherently different. Because of late module patching, module-specific KLP relocations can be applied very late, thus they can create the ordering headaches described above. But vmlinux-specific KLP relocations don't have that problem. There's nothing to prevent them from being applied earlier. So apply them at the same time as normal relocations, when the KLP module is being loaded. This means that for vmlinux-specific KLP relocations, we no longer have any ordering issues. vmlinux-referencing jump labels, alternatives, and paravirt patching will work automatically, without the need for the .klp.arch hacks. All that said, for module-specific KLP relocations, the ordering problems still exist and we *do* still need .klp.arch. Or do we? Stay tuned. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08livepatch: Disallow vmlinux.koJosh Poimboeuf
This is purely a theoretical issue, but if there were a module named vmlinux.ko, the livepatch relocation code wouldn't be able to distinguish between vmlinux-specific and vmlinux.o-specific KLP relocations. If CONFIG_LIVEPATCH is enabled, don't allow a module named vmlinux.ko. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-07exec: Merge install_exec_creds into setup_new_execEric W. Biederman
The two functions are now always called one right after the other so merge them together to make future maintenance easier. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07tracing: Make tracing_snapshot_instance_cond() staticZou Wei
Fix the following sparse warning: kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond' was not declared. Should it be static? Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07tracing: Add a vmalloc_sync_mappings() for safe measureSteven Rostedt (VMware)
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults. Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear. Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Suggested-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07tracing: Wait for preempt irq delay thread to finishSteven Rostedt (VMware)
Running on a slower machine, it is possible that the preempt delay kernel thread may still be executing if the module was immediately removed after added, and this can cause the kernel to crash as the kernel thread might be executing after its code has been removed. There's no reason that the caller of the code shouldn't just wait for the delay thread to finish, as the thread can also be created by a trigger in the sysfs code, which also has the same issues. Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 793937236d1ee ("lib: Add module for testing preemptoff/irqsoff latency tracers") Reported-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a', ↵Paul E. McKenney
'rcu-tasks.2020.04.27a', 'stall.2020.04.27a' and 'torture.2020.05.07a' into HEAD fixes.2020.04.27a: Miscellaneous fixes. kfree_rcu.2020.04.27a: Changes related to kfree_rcu(). rcu-tasks.2020.04.27a: Addition of new RCU-tasks flavors. stall.2020.04.27a: RCU CPU stall-warning updates. torture.2020.05.07a: Torture-test updates.
2020-05-07rcutorture: Convert ULONG_CMP_LT() to time_before()Paul E. McKenney
This commit converts three ULONG_CMP_LT() invocations in rcutorture to time_before() to reflect the fact that they are comparing timestamps to the jiffies counter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07rcutorture: Make rcu_fwds and rcu_fwd_emergency_stop staticJason Yan
This commit fixes the following sparse warning: kernel/rcu/rcutorture.c:1695:16: warning: symbol 'rcu_fwds' was not declared. Should it be static? kernel/rcu/rcutorture.c:1696:6: warning: symbol 'rcu_fwd_emergency_stop' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07rcu: Allow rcutorture to starve grace-period kthreadPaul E. McKenney
This commit provides an rcutorture.stall_gp_kthread module parameter to allow rcutorture to starve the grace-period kthread. This allows testing the code that detects such starvation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07rcutorture: Add flag to produce non-busy-wait task stallsPaul E. McKenney
This commit aids testing of RCU task stall warning messages by adding an rcutorture.stall_cpu_block module parameter that results in the induced stall sleeping within the RCU read-side critical section. Spinning with interrupts disabled is still available via the rcutorture.stall_cpu_irqsoff module parameter, and specifying neither of these two module parameters will spin with preemption disabled. Note that sleeping (as opposed to preemption) results in additional complaints from RCU at context-switch time, so yet more testing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07kgdb: Drop malformed kernel doc commentAndy Shevchenko
Kernel doc does not understand POD variables to be referred to. .../debug_core.c:73: warning: cannot understand function prototype: 'int kgdb_connected; ' Convert kernel doc to pure comment. Fixes: dc7d55270521 ("kgdb: core") Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-07cpu/hotplug: Remove __freeze_secondary_cpus()Qais Yousef
The refactored function is no longer required as the codepaths that call freeze_secondary_cpus() are all suspend/resume related now. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Link: https://lkml.kernel.org/r/20200430114004.17477-2-qais.yousef@arm.com
2020-05-07cpu/hotplug: Remove disable_nonboot_cpus()Qais Yousef
The single user could have called freeze_secondary_cpus() directly. Since this function was a source of confusion, remove it as it's just a pointless wrapper. While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to preserve the naming symmetry. Done automatically via: git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g' Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
2020-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Conflicts were all overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06tracing/kprobes: Reject new event if loc is NULLMasami Hiramatsu
Reject the new event which has NULL location for kprobes. For kprobes, user must specify at least the location. Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06tracing/boottime: Fix kprobe event API usageMasami Hiramatsu
Fix boottime kprobe events to use API correctly for multiple events. For example, when we set a multiprobe kprobe events in bootconfig like below, ftrace.event.kprobes.myevent { probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2" } This cause an error; trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2 This shows the 1st argument becomes NULL and multiprobes are merged to 1 probe. Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2 Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: 29a154810546 ("tracing: Change trace_boot to use kprobe_event interface") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06tracing/kprobes: Fix a double initialization typoMasami Hiramatsu
Fix a typo that resulted in an unnecessary double initialization to addr. Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: c7411a1a126f ("tracing/kprobe: Check whether the non-suffixed symbol is notrace") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-05signal: refactor copy_siginfo_to_user32Christoph Hellwig
Factor out a copy_siginfo_to_external32 helper from copy_siginfo_to_user32 that fills out the compat_siginfo, but does so on a kernel space data structure. With that we can let architectures override copy_siginfo_to_user32 with their own implementations using copy_siginfo_to_external32. That allows moving the x32 SIGCHLD purely to x86 architecture code. As a nice side effect copy_siginfo_to_external32 also comes in handy for avoiding a set_fs() call in the coredump code later on. Contains improvements from Eric W. Biederman <ebiederm@xmission.com> and Arnd Bergmann <arnd@arndb.de>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-05sysctl: Fix unused function warningArnd Bergmann
The newly added bpf_stats_handler function has the wrong #ifdef check around it, leading to an unused-function warning when CONFIG_SYSCTL is disabled: kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function] static int bpf_stats_handler(struct ctl_table *table, int write, Fix the check to match the reference. Fixes: d46edd671a14 ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de
2020-05-05workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.Sean Fu
Replace inline function PTR_ERR_OR_ZERO with IS_ERR and PTR_ERR to remove redundant parameter definitions and checks. Reduce code size. Before: text data bss dec hex filename 47510 5979 840 54329 d439 kernel/workqueue.o After: text data bss dec hex filename 47474 5979 840 54293 d415 kernel/workqueue.o Signed-off-by: Sean Fu <fxinrong@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-01bpf: Fix use-after-free of bpf_link when priming half-failsAndrii Nakryiko
If bpf_link_prime() succeeds to allocate new anon file, but then fails to allocate ID for it, link priming is considered to be failed and user is supposed ot be able to directly kfree() bpf_link, because it was never exposed to user-space. But at that point file already keeps a pointer to bpf_link and will eventually call bpf_link_release(), so if bpf_link was kfree()'d by caller, that would lead to use-after-free. Fix this by first allocating ID and only then allocating file. Adding ID to link_idr is ok, because link at that point still doesn't have its ID set, so no user-space process can create a new FD for it. Fixes: a3b80e107894 ("bpf: Allocate ID for bpf_link") Reported-by: syzbot+39b64425f91b5aab714d@syzkaller.appspotmail.com Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200501185622.3088964-1-andriin@fb.com
2020-05-01bpf: Sharing bpf runtime stats with BPF_ENABLE_STATSSong Liu
Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats. Typical userspace tools use kernel.bpf_stats_enabled as follows: 1. Enable kernel.bpf_stats_enabled; 2. Check program run_time_ns; 3. Sleep for the monitoring period; 4. Check program run_time_ns again, calculate the difference; 5. Disable kernel.bpf_stats_enabled. The problem with this approach is that only one userspace tool can toggle this sysctl. If multiple tools toggle the sysctl at the same time, the measurement may be inaccurate. To fix this problem while keep backward compatibility, introduce a new bpf command BPF_ENABLE_STATS. On success, this command enables stats and returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently, only one type, BPF_STATS_RUN_TIME, is supported. We can extend the command to support other types of stats in the future. With BPF_ENABLE_STATS, user space tool would have the following flow: 1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid; 2. Check program run_time_ns; 3. Sleep for the monitoring period; 4. Check program run_time_ns again, calculate the difference; 5. Close the fd. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
2020-05-01audit: make symbol 'audit_nfcfgs' staticZheng Bin
Fix sparse warnings: kernel/auditsc.c:138:32: warning: symbol 'audit_nfcfgs' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-01uaccess: Selectively open read or write user accessChristophe Leroy
When opening user access to only perform reads, only open read access. When opening user access to only perform writes, only open write access. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr
2020-04-30sched/core: Simplify sched_init()Wei Yang
Currently root_task_group.shares and cfs_bandwidth are initialized for each online cpu, which not necessary. Let's take it out to do it only once. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200423214443.29994-1-richard.weiyang@gmail.com
2020-04-30sched/fair: Use __this_cpu_read() in wake_wide()Muchun Song
The code is executed with preemption(and interrupts) disabled, so it's safe to use __this_cpu_write(). Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200421144123.33580-1-songmuchun@bytedance.com
2020-04-30sched/core: Fix illegal RCU from offline CPUsPeter Zijlstra
In the CPU-offline process, it calls mmdrop() after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaining when mmdrop() uses RCU from either memcg or debugobjects below. Fix it by cleaning up the active_mm state from BP instead. Every arch which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit() from AP. The only exception is parisc because it switches them to &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()), but the patch will still work there because it calls mmgrab(&init_mm) in smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu(). WARNING: suspicious RCU usage ----------------------------- kernel/workqueue.c:710 RCU or wq_pool_mutex should be held! other info that might help us debug this: RCU used illegally from offline CPU! Call Trace: dump_stack+0xf4/0x164 (unreliable) lockdep_rcu_suspicious+0x140/0x164 get_work_pool+0x110/0x150 __queue_work+0x1bc/0xca0 queue_work_on+0x114/0x120 css_release+0x9c/0xc0 percpu_ref_put_many+0x204/0x230 free_pcp_prepare+0x264/0x570 free_unref_page+0x38/0xf0 __mmdrop+0x21c/0x2c0 idle_task_exit+0x170/0x1b0 pnv_smp_cpu_kill_self+0x38/0x2e0 cpu_die+0x48/0x64 arch_cpu_idle_dead+0x30/0x50 do_idle+0x2f4/0x470 cpu_startup_entry+0x38/0x40 start_secondary+0x7a8/0xa80 start_secondary_resume+0x10/0x14 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
2020-04-30sched/fair: Mark sched_init_granularity __initMuchun Song
Function sched_init_granularity() is only called from __init functions, so mark it __init as well. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20200406074750.56533-1-songmuchun@bytedance.com
2020-04-30sched/fair: Refill bandwidth before scalingHuaixin Chang
In order to prevent possible hardlockup of sched_cfs_period_timer() loop, loop count is introduced to denote whether to scale quota and period or not. However, scale is done between forwarding period timer and refilling cfs bandwidth runtime, which means that period timer is forwarded with old "period" while runtime is refilled with scaled "quota". Move do_sched_cfs_period_timer() before scaling to solve this. Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200420024421.22442-3-changhuaixin@linux.alibaba.com
2020-04-30sched: Extract the task putting code from pick_next_task()Chen Yu
Introduce a new function put_prev_task_balance() to do the balance when necessary, and then put previous task back to the run queue. This function is extracted from pick_next_task() to prepare for future usage by other type of task picking logic. No functional change. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/5a99860cf66293db58a397d6248bcb2eee326776.1587464698.git.yu.c.chen@intel.com
2020-04-30sched: Make newidle_balance() static againChen Yu
After Commit 6e2df0581f56 ("sched: Fix pick_next_task() vs 'change' pattern race"), there is no need to expose newidle_balance() as it is only used within fair.c file. Change this function back to static again. No functional change. Reported-by: kbuild test robot <lkp@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/83cd3030b031ca5d646cd5e225be10e7a0fdd8f5.1587464698.git.yu.c.chen@intel.com
2020-04-30sched/topology: Kill SD_LOAD_BALANCEValentin Schneider
That flag is set unconditionally in sd_init(), and no one checks for it anymore. Remove it. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com
2020-04-30sched: Remove checks against SD_LOAD_BALANCEValentin Schneider
The SD_LOAD_BALANCE flag is set unconditionally for all domains in sd_init(). By making the sched_domain->flags syctl interface read-only, we have removed the last piece of code that could clear that flag - as such, it will now be always present. Rather than to keep carrying it along, we can work towards getting rid of it entirely. cpusets don't need it because they can make CPUs be attached to the NULL domain (e.g. cpuset with sched_load_balance=0), or to a partitioned root_domain, i.e. a sched_domain hierarchy that doesn't span the entire system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with sched_load_balance=1). isolcpus apply the same "trick": isolated CPUs are explicitly taken out of the sched_domain rebuild (using housekeeping_cpumask()), so they get the NULL domain treatment as well. Remove the checks against SD_LOAD_BALANCE. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-4-valentin.schneider@arm.com
2020-04-30sched/debug: Make sd->flags sysctl read-onlyValentin Schneider
Writing to the sysctl of a sched_domain->flags directly updates the value of the field, and goes nowhere near update_top_cache_domain(). This means that the cached domain pointers can end up containing stale data (e.g. the domain pointed to doesn't have the relevant flag set anymore). Explicit domain walks that check for flags will be affected by the write, but this won't be in sync with the cached pointers which will still point to the domains that were cached at the last sched_domain build. In other words, writing to this interface is playing a dangerous game. It could be made to trigger an update of the cached sched_domain pointers when written to, but this does not seem to be worth the trouble. Make it read-only. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com
2020-04-30sched/fair: find_idlest_group(): Remove unused sd_flag parameterValentin Schneider
The last use of that parameter was removed by commit 57abff067a08 ("sched/fair: Rework find_idlest_group()") Get rid of the parameter. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200415210512.805-2-valentin.schneider@arm.com
2020-04-30exit: Move preemption fixup up, move blocking operations downJann Horn
With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in non-preemptible context look untidy; after the main oops, the kernel prints a "sleeping function called from invalid context" report because exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read() can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED) fixup. It looks like the same thing applies to profile_task_exit() and kcov_task_exit(). Fix it by moving the preemption fixup up and the calls to profile_task_exit() and kcov_task_exit() down. Fixes: 1dc0fffc48af ("sched/core: Robustify preemption leak checks") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com
2020-04-30sched/fair: Simplify the code of should_we_balance()Peng Wang
We only consider group_balance_cpu() after there is no idle cpu. So, just do comparison before return at these two cases. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/245c792f0e580b3ca342ad61257f4c066ee0f84f.1586594833.git.rocking@linux.alibaba.com
2020-04-30sched/fair: Remove distribute_running from CFS bandwidthJosh Don
This is mostly a revert of commit: baa9be4ffb55 ("sched/fair: Fix throttle_list starvation with low CFS quota") The primary use of distribute_running was to determine whether to add throttled entities to the head or the tail of the throttled list. Now that we always add to the tail, we can remove this field. The other use of distribute_running is in the slack_timer, so that we don't start a distribution while one is already running. However, even in the event that this race occurs, it is fine to have two distributions running (especially now that distribute grabs the cfs_b->lock to determine remaining quota before assigning). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200410225208.109717-3-joshdon@google.com
2020-04-30sched/fair: Eliminate bandwidth race between throttling and distributionPaul Turner
There is a race window in which an entity begins throttling before quota is added to the pool, but does not finish throttling until after we have finished with distribute_cfs_runtime(). This entity is not observed by distribute_cfs_runtime() because it was not on the throttled list at the time that distribution was running. This race manifests as rare period-length statlls for such entities. Rather than heavy-weight the synchronization with the progress of distribution, we can fix this by aborting throttling if bandwidth has become available. Otherwise, we immediately add the entity to the throttled list so that it can be observed by a subsequent distribution. Additionally, we can remove the case of adding the throttled entity to the head of the throttled list, and simply always add to the tail. Thanks to 26a8b12747c97, distribute_cfs_runtime() no longer holds onto its own pool of runtime. This means that if we do hit the !assign and distribute_running case, we know that distribution is about to end. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com
2020-04-30sched/debug: Fix trival print_task() formatXie XiuQi
Ensure leave one space between state and task name. w/o patch: runnable tasks: S task PID tree-key switches prio wait Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200414125721.195801-1-xiexiuqi@huawei.com
2020-04-30perf: Add cond_resched() to task_function_call()Barret Rhoden
Under rare circumstances, task_function_call() can repeatedly fail and cause a soft lockup. There is a slight race where the process is no longer running on the cpu we targeted by the time remote_function() runs. The code will simply try again. If we are very unlucky, this will continue to fail, until a watchdog fires. This can happen in a heavily loaded, multi-core virtual machine. Reported-by: syzbot+bb4935a5c09b5ff79940@syzkaller.appspotmail.com Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200414222920.121401-1-brho@google.com
2020-04-30bpf: Fix error return code in map_lookup_and_delete_elem()Wei Yongjun
Fix to return negative error code -EFAULT from the copy_to_user() error handling case instead of 0, as done elsewhere in this function. Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com
2020-04-30posix-cpu-timers: Use pids not tasks in lookupEric W. Biederman
The current posix-cpu-timer code uses pids when holding persistent references in timers. However the lookups from clock_id_t still return tasks that need to be converted into pids for use. This results in usage being pid->task->pid and that can race with release_task and de_thread. This can lead to some not wrong but surprising results. Surprising enough that Oleg and I both thought there were some bugs in the code for a while. This set of changes modifies the code to just lookup, verify, and return pids from the clockid_t lookups to remove those potentialy troublesome races. Eric W. Biederman (3): posix-cpu-timers: Extend rcu_read_lock removing task_struct references posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock kernel/time/posix-cpu-timers.c | 102 ++++++++++++++++++----------------------- 1 file changed, 45 insertions(+), 57 deletions(-) Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-30remove the no longer needed pid_alive() check in __task_pid_nr_ns()Oleg Nesterov
Starting from 2c4704756cab ("pids: Move the pgrp and session pid pointers from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference task->group_leader, we can remove the pid_alive() check. pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the unnecessary confusion. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-30padata: add separate cpuhp node for CPUHP_PADATA_DEADDaniel Jordan
Removing the pcrypt module triggers this: general protection fault, probably for non-canonical address 0xdead000000000122 CPU: 5 PID: 264 Comm: modprobe Not tainted 5.6.0+ #2 Hardware name: QEMU Standard PC RIP: 0010:__cpuhp_state_remove_instance+0xcc/0x120 Call Trace: padata_sysfs_release+0x74/0xce kobject_put+0x81/0xd0 padata_free+0x12/0x20 pcrypt_exit+0x43/0x8ee [pcrypt] padata instances wrongly use the same hlist node for the online and dead states, so __padata_free()'s second cpuhp remove call chokes on the node that the first poisoned. cpuhp multi-instance callbacks only walk forward in cpuhp_step->list and the same node is linked in both the online and dead lists, so the list corruption that results from padata_alloc() adding the node to a second list without removing it from the first doesn't cause problems as long as no instances are freed. Avoid the issue by giving each state its own node. Fixes: 894c9ef9780c ("padata: validate cpumask without removed CPU during offline") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-04-30bpf: Fix unused variable warningArnd Bergmann
Hiding the only using of bpf_link_type_strs[] in an #ifdef causes an unused-variable warning: kernel/bpf/syscall.c:2280:20: error: 'bpf_link_type_strs' defined but not used [-Werror=unused-variable] 2280 | static const char *bpf_link_type_strs[] = { Move the definition into the same #ifdef. Fixes: f2e10bff16a0 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200429132217.1294289-1-arnd@arndb.de
2020-04-29bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASHJakub Sitnicki
White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a pointer to a full socket and acquires a reference if necessary. To support it we need to extend the verifier to know that: (1) register storing the lookup result holds a pointer to socket, if lookup was done on SOCKMAP/SOCKHASH, and that (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation, which needs a corresponding reference release with bpf_sk_release. On sock_map side, lookup handlers exposed via bpf_map_ops now bump sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport, the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was updated to release the reference. Sockets fetched from a map can be used in the same way as ones returned by BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they can be used with bpf_sk_assign to direct packets toward a socket on TC ingress path. Suggested-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
2020-04-29posix-cpu-timers: Replace __get_task_for_clock with pid_for_clockEric W. Biederman
Now that the codes store references to pids instead of referendes to tasks. Looking up a task for a clock instead of looking up a struct pid makes the code more difficult to verify it is correct than necessary. In posix_cpu_timers_create get_task_pid can race with release_task for threads and return a NULL pid. As put_pid and cpu_timer_task_rcu handle NULL pids just fine the code works without problems but it is an extra case to consider and keep in mind while verifying and modifying the code. There are races with de_thread to consider that only don't apply because thread clocks are only allowed for threads in the same thread_group. So instead of leaving a burden for people making modification to the code in the future return a rcu protected struct pid for the clock instead. The logic for __get_task_for_pid and lookup_task has been folded into the new function pid_for_clock with the only change being the logic has been modified from working on a task to working on a pid that will be returned. In posix_cpu_clock_get instead of calling pid_for_clock checking the result and then calling pid_task to get the task. The result of pid_for_clock is fed directly into pid_task. This is safe because pid_task handles NULL pids. As such an extra error check was unnecessary. Instead of hiding the flag that enables the special clock_gettime handling, I have made the 3 callers just pass the flag in themselves. That is less code and seems just as simple to work with as the wrapper functions. Historically the clock_gettime special case of allowing a process clock to be found by the thread id did not even exist [33ab0fec3352] but Thomas Gleixner reports that he has found code that uses that functionality [55e8c8eb2c7b]. Link: https://lkml.kernel.org/r/87zhaxqkwa.fsf@nanos.tec.linutronix.de/ Ref: 33ab0fec3352 ("posix-timers: Consolidate posix_cpu_clock_get()") Ref: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>