summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-08-02cgroup: re-use the parent pointer in cgroup_destroy_locked()Roman Gushchin
As we already have a pointer to the parent cgroup in cgroup_destroy_locked(), we don't need to calculate it again to pass as an argument for cgroup1_check_for_release(). Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: linux-kernel@vger.kernel.org
2017-08-02cgroup: add cgroup.stat interface with basic hierarchy statsRoman Gushchin
A cgroup can consume resources even after being deleted by a user. For example, writing back dirty pages should be accounted and limited, despite the corresponding cgroup might contain no processes and being deleted by a user. In the current implementation a cgroup can remain in such "dying" state for an undefined amount of time. For instance, if a memory cgroup contains a pge, mlocked by a process belonging to an other cgroup. Although the lifecycle of a dying cgroup is out of user's control, it's important to have some insight of what's going on under the hood. In particular, it's handy to have a counter which will allow to detect css leaks. To solve this problem, add a cgroup.stat interface to the base cgroup control files with the following metrics: nr_descendants total number of visible descendant cgroups nr_dying_descendants total number of dying descendant cgroups Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
2017-08-02cgroup: implement hierarchy limitsRoman Gushchin
Creating cgroup hierearchies of unreasonable size can affect overall system performance. A user might want to limit the size of cgroup hierarchy. This is especially important if a user is delegating some cgroup sub-tree. To address this issue, introduce an ability to control the size of cgroup hierarchy. The cgroup.max.descendants control file allows to set the maximum allowed number of descendant cgroups. The cgroup.max.depth file controls the maximum depth of the cgroup tree. Both are single value r/w files, with "max" default value. The control files exist on each hierarchy level (including root). When a new cgroup is created, we check the total descendants and depth limits on each level, and if none of them are exceeded, a new cgroup is created. Only alive cgroups are counted, removed (dying) cgroups are ignored. Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
2017-08-02cgroup: keep track of number of descent cgroupsRoman Gushchin
Keep track of the number of online and dying descent cgroups. This data will be used later to add an ability to control cgroup hierarchy (limit the depth and the number of descent cgroups) and display hierarchy stats. Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
2017-07-25cgroup: add comment to cgroup_enable_threaded()Tejun Heo
Explain cgroup_enable_threaded() and note that the function can never be called on the root cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Waiman Long <longman@redhat.com>
2017-07-25cgroup: remove unnecessary empty check when enabling threaded modeTejun Heo
cgroup_enable_threaded() checks that the cgroup doesn't have any tasks or children and fails the operation if so. This test is unnecessary because the first part is already checked by cgroup_can_be_thread_root() and the latter is unnecessary. The latter actually cause a behavioral oddity. Please consider the following hierarchy. All cgroups are domains. A / \ B C \ D If B is made threaded, C and D becomes invalid domains. Due to the no children restriction, threaded mode can't be enabled on C. For C and D, the only thing the user can do is removal. There is no reason for this restriction. Remove it. Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: update debug controller to print out thread mode informationWaiman Long
Update debug controller so that it prints out debug info about thread mode. 1) The relationship between proc_cset and threaded_csets are displayed. 2) The status of being a thread root or threaded cgroup is displayed. This patch is extracted from Waiman's larger patch. v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links file as the same information is available from cgroup.type. Suggested by Waiman. - Threaded marking is moved to the previous patch. Patch-originally-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: implement cgroup v2 thread supportTejun Heo
This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. A cgroup is always created as a domain and can be made threaded by writing to the "cgroup.type" file. When a cgroup becomes threaded, it becomes a member of a threaded subtree which is anchored at the closest ancestor which isn't threaded. The threads of the processes which are in a threaded subtree can be placed anywhere without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, the nearest ancestor which isn't threaded, is called the threaded domain and serves as the resource domain for the whole subtree. This is the last cgroup where domain controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. As the root cgroup is exempt from the no-internal-process constraint, it can serve as both a threaded domain and a parent to normal cgroups, so, unlike non-root cgroups, the root cgroup can have both domain and threaded children. Internally, in a threaded subtree, each css_set has its ->dom_cset pointing to a matching css_set which belongs to the threaded domain. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create(). Spotted by Waiman. - Documentation updated as suggested by Waiman. - cgroup.type content slightly reformatted. - Mark the debug controller threaded. v4: - Updated to the general idea of marking specific cgroups domain/threaded as suggested by PeterZ. v3: - Dropped "join" and always make mixed children join the parent's threaded subtree. v2: - After discussions with Waiman, support for mixed thread mode is added. This should address the issue that Peter pointed out where any nesting should be avoided for thread subtrees while coexisting with other domain cgroups. - Enabling / disabling thread mode now piggy backs on the existing control mask update mechanism. - Bug fixes and cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21cgroup: implement CSS_TASK_ITER_THREADEDTejun Heo
cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the dom_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a dom_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new enable-threaded-per-cgroup behavior. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: introduce cgroup->dom_cgrp and threaded css_set handlingTejun Heo
cgroup v2 is in the process of growing thread granularity support. A threaded subtree is composed of a thread root and threaded cgroups which are proper members of the subtree. The root cgroup of the subtree serves as the domain cgroup to which the processes (as opposed to threads / tasks) of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. Inside the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->dom_cgrp along with threaded css_set handling. * cgroup->dom_cgrp points to self for normal and thread roots. For proper thread subtree members, points to the dom_cgrp (the thread root). * css_set->dom_cset points to self if for normal and thread roots. If threaded, points to the css_set which belongs to the cgrp->dom_cgrp. The dom_cgrp serves as the resource domain and keeps the matching csses available. The dom_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their dom_csets to enable iteration of all threaded tasks. * cgroup->nr_threaded_children keeps track of the number of threaded children. This patch adds the above but doesn't actually use them yet. The following patches will build on top. v4: ->nr_threaded_children added. v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new enable-threaded-per-cgroup behavior. v2: Added cgroup_is_threaded() helper. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCSTejun Heo
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: reorganize cgroup.procs / task write pathTejun Heo
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all handled by __cgroup_procs_write() on both v1 and v2. This patch reoragnizes the write path so that there are common helper functions that different write paths use. While this somewhat increases LOC, the different paths are no longer intertwined and each path has more flexibility to implement different behaviors which will be necessary for the planned v2 thread support. v3: - Restructured so that cgroup_procs_write_permission() takes @src_cgrp and @dst_cgrp. v2: - Rolled in Waiman's task reference count fix. - Updated on top of nsdelegate changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com>
2017-07-16cgroup: replace css_set walking populated test with testing ↵Tejun Heo
cgrp->nr_populated_csets Implement trivial cgroup_has_tasks() which tests whether cgrp->nr_populated_csets is zero and replace the explicit local populated test in cgroup_subtree_control(). This simplifies the code and cgroup_has_tasks() will be used in more places later. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-16cgroup: distinguish local and children populated statesTejun Heo
cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-16cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()Tejun Heo
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-15Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-14Merge tag 'pm-fixes-4.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a recently exposed issue in the PCI device wakeup code and one older problem related to PCI device wakeup that has been reported recently, modify one more piece of computations in intel_pstate to get rid of a rounding error, fix a possible race in the schedutil cpufreq governor, fix the device PM QoS sysfs interface to correctly handle invalid user input, fix return values of two probe routines in devfreq drivers and constify an attribute_group structure in devfreq. Specifics: - Avoid clearing the PCI PME Enable bit for devices as a result of config space restoration which confuses AML executed afterward and causes wakeup events to be lost on some systems (Rafael Wysocki). - Fix the native PCIe PME interrupts handling in the cases when the PME IRQ is set up as a system wakeup one so that runtime PM remote wakeup works as expected after system resume on systems where that happens (Rafael Wysocki). - Fix the device PM QoS sysfs interface to handle invalid user input correctly instead of using an unititialized variable value as the latency tolerance for the device at hand (Dan Carpenter). - Get rid of one more rounding error from intel_pstate computations (Srinivas Pandruvada). - Fix the schedutil cpufreq governor to prevent it from possibly accessing unititialized data structures from governor callbacks in some cases on systems when multiple CPUs share a single cpufreq policy object (Vikram Mulukutla). - Fix the return values of probe routines in two devfreq drivers (Gustavo Silva). - Constify an attribute_group structure in devfreq (Arvind Yadav)" * tag 'pm-fixes-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PCI / PM: Fix native PME handling during system suspend/resume PCI / PM: Restore PME Enable after config space restoration cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race PM / QoS: return -EINVAL for bogus strings cpufreq: intel_pstate: Fix ratio setting for min_perf_pct PM / devfreq: constify attribute_group structures. PM / devfreq: tegra: fix error return code in tegra_devfreq_probe() PM / devfreq: rk3399_dmc: fix error return code in rk3399_dmcfreq_probe()
2017-07-14kmod: throttle kmod thread limitLuis R. Rodriguez
If we reach the limit of modprobe_limit threads running the next request_module() call will fail. The original reason for adding a kill was to do away with possible issues with in old circumstances which would create a recursive series of request_module() calls. We can do better than just be super aggressive and reject calls once we've reached the limit by simply making pending callers wait until the threshold has been reduced, and then throttling them in, one by one. This throttling enables requests over the kmod concurrent limit to be processed once a pending request completes. Only the first item queued up to wait is woken up. The assumption here is once a task is woken it will have no other option to also kick the queue to check if there are more pending tasks -- regardless of whether or not it was successful. By throttling and processing only max kmod concurrent tasks we ensure we avoid unexpected fatal request_module() calls, and we keep memory consumption on module loading to a minimum. With x86_64 qemu, with 4 cores, 4 GiB of RAM it takes the following run time to run both tests: time ./kmod.sh -t 0008 real 0m16.366s user 0m0.883s sys 0m8.916s time ./kmod.sh -t 0009 real 0m50.803s user 0m0.791s sys 0m9.852s Link: http://lkml.kernel.org/r/20170628223155.26472-4-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14kernel/watchdog.c: use better pr_fmt prefixKefeng Wang
After commit 73ce0511c436 ("kernel/watchdog.c: move hardlockup detector to separate file"), 'NMI watchdog' is inappropriate in kernel/watchdog.c, using 'watchdog' only. Link: http://lkml.kernel.org/r/1499928642-48983-1-git-send-email-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Babu Moger <babu.moger@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14Merge branches 'pm-cpufreq-sched' and 'intel_pstate'Rafael J. Wysocki
* pm-cpufreq-sched: cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race * intel_pstate: cpufreq: intel_pstate: Fix ratio setting for min_perf_pct
2017-07-13Merge tag 'trace-v4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: "A few more minor updates: - Show the tgid mappings for user space trace tools to use - Fix and optimize the comm and tgid cache recording - Sanitize derived kprobe names - Ftrace selftest updates - trace file header fix - Update of Documentation/trace/ftrace.txt - Compiler warning fixes - Fix possible uninitialized variable" * tag 'trace-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix uninitialized variable in match_records() ftrace: Remove an unneeded NULL check ftrace: Hide cached module code for !CONFIG_MODULES tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACE tracing: Update Documentation/trace/ftrace.txt tracing: Fixup trace file header alignment selftests/ftrace: Add a testcase for kprobe event naming selftests/ftrace: Add a test to probe module functions selftests/ftrace: Update multiple kprobes test for powerpc trace/kprobes: Sanitize derived event names tracing: Attempt to record other information even if some fail tracing: Treat recording tgid for idle task as a success tracing: Treat recording comm for idle task as a success tracing: Add saved_tgids file to show cached pid to tgid mappings
2017-07-13Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: - various misc things - kexec updates - sysctl core updates - scripts/gdb udpates - checkpoint-restart updates - ipc updates - kernel/watchdog updates - Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature" - "stackprotector: ascii armor the stack canary" - more MM bits - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) writeback: rework wb_[dec|inc]_stat family of functions ARM: samsung: usb-ohci: move inline before return type video: fbdev: omap: move inline before return type video: fbdev: intelfb: move inline before return type USB: serial: safe_serial: move __inline__ before return type drivers: tty: serial: move inline before return type drivers: s390: move static and inline before return type x86/efi: move asmlinkage before return type sh: move inline before return type MIPS: SMP: move asmlinkage before return type m68k: coldfire: move inline before return type ia64: sn: pci: move inline before type ia64: move inline before return type FRV: tlbflush: move asmlinkage before return type CRIS: gpio: move inline before return type ARM: HP Jornada 7XX: move inline before return type ARM: KVM: move asmlinkage before type checkpatch: improve the STORAGE_CLASS test mm, migration: do not trigger OOM killer when migrating memory drm/i915: use __GFP_RETRY_MAYFAIL ...
2017-07-12Merge tag 'modules-for-v4.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.13 merge window: - Minor code cleanups - Avoid accessing mod struct prior to checking module struct version, from Kees - Fix racy atomic inc/dec logic of kmod_concurrent_max in kmod, from Luis" * tag 'modules-for-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: make the modinfo name const kmod: reduce atomic operations on kmod_concurrent and simplify module: use list_for_each_entry_rcu() on find_module_all() kernel/module.c: suppress warning about unused nowarn variable module: Add module name to modinfo module: Pass struct load_info into symbol checks
2017-07-12fork,random: use get_random_canary() to set tsk->stack_canaryRik van Riel
Use the ascii-armor canary to prevent unterminated C string overflows from being able to successfully overwrite the canary, even if they somehow obtain the canary value. Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree. Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kexec_file: adjust declaration of kexec_purgatoryKees Cook
Defining kexec_purgatory as a zero-length char array upsets compile time size checking. Since this is built on a per-arch basis, define it as an unsized char array (like is done for other similar things, e.g. linker sections). This silences the warning generated by the future CONFIG_FORTIFY_SOURCE, which did not like the memcmp() of a "0 byte" array. This drops the __weak and uses an extern instead, since both users define kexec_purgatory. Link: http://lkml.kernel.org/r/1497903987-21002-4-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/watchdog: provide watchdog_nmi_reconfigure() for arch watchdogsNicholas Piggin
After reconfiguring watchdog sysctls etc., architecture specific watchdogs may not get all their parameters updated. watchdog_nmi_reconfigure() can be implemented to pull the new values in and set the arch NMI watchdog. [npiggin@gmail.com: add code comments] Link: http://lkml.kernel.org/r/20170617125933.774d3858@roar.ozlabs.ibm.com [arnd@arndb.de: hide unused function] Link: http://lkml.kernel.org/r/20170620204854.966601-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170616065715.18390-5-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Don Zickus <dzickus@redhat.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/watchdog: split up config optionsNicholas Piggin
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR. LOCKUP_DETECTOR implies the general boot, sysctl, and programming interfaces for the lockup detectors. An architecture that wants to use a hard lockup detector must define HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH. Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and does not implement the LOCKUP_DETECTOR interfaces. sparc is unusual in that it has started to implement some of the interfaces, but not fully yet. It should probably be converted to a full HAVE_HARDLOCKUP_DETECTOR_ARCH. [npiggin@gmail.com: fix] Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/watchdog: introduce arch_touch_nmi_watchdog()Nicholas Piggin
For architectures that define HAVE_NMI_WATCHDOG, instead of having them provide the complete touch_nmi_watchdog() function, just have them provide arch_touch_nmi_watchdog(). This gives the generic code more flexibility in implementing this function, and arch implementations don't miss out on touching the softlockup watchdog or other generic details. Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12fault-inject: support systematic fault injectionDmitry Vyukov
Add /proc/self/task/<current-tid>/fail-nth file that allows failing 0-th, 1-st, 2-nd and so on calls systematically. Excerpt from the added documentation: "Write to this file of integer N makes N-th call in the current task fail (N is 0-based). Read from this file returns a single char 'Y' or 'N' that says if the fault setup with a previous write to this file was injected or not, and disables the fault if it wasn't yet injected. Note that this file enables all types of faults (slab, futex, etc). This setting takes precedence over all other generic settings like probability, interval, times, etc. But per-capability settings (e.g. fail_futex/ignore-private) take precedence over it. This feature is intended for systematic testing of faults in a single system call. See an example below" Why add a new setting: 1. Existing settings are global rather than per-task. So parallel testing is not possible. 2. attr->interval is close but it depends on attr->count which is non reset to 0, so interval does not work as expected. 3. Trying to model this with existing settings requires manipulations of all of probability, interval, times, space, task-filter and unexposed count and per-task make-it-fail files. 4. Existing settings are per-failure-type, and the set of failure types is potentially expanding. 5. make-it-fail can't be changed by unprivileged user and aggressive stress testing better be done from an unprivileged user. Similarly, this would require opening the debugfs files to the unprivileged user, as he would need to reopen at least times file (not possible to pre-open before dropping privs). The proposed interface solves all of the above (see the example). We want to integrate this into syzkaller fuzzer. A prototype has found 10 bugs in kernel in first day of usage: https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance I've made the current interface work with all types of our sandboxes. For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to make /proc entries non-root owned. So I am fine with the current version of the code. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kcmp: add KCMP_EPOLL_TFD mode to compare epoll target filesCyrill Gorcunov
With current epoll architecture target files are addressed with file_struct and file descriptor number, where the last is not unique. Moreover files can be transferred from another process via unix socket, added into queue and closed then so we won't find this descriptor in the task fdinfo list. Thus to checkpoint and restore such processes CRIU needs to find out where exactly the target file is present to add it into epoll queue. For this sake one can use kcmp call where some particular target file from the queue is compared with arbitrary file passed as an argument. Because epoll target files can have same file descriptor number but different file_struct a caller should explicitly specify the offset within. To test if some particular file is matching entry inside epoll one have to - fill kcmp_epoll_slot structure with epoll file descriptor, target file number and target file offset (in case if only one target is present then it should be 0) - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot) - the kernel fetch file pointer matching file descriptor @fd of pid1 - lookups for file struct in epoll queue of pid2 and returns traditional 0,1,2 result for sorting purpose Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/sysctl_binary.c: check name array length in deprecated_sysctl_warning()Mateusz Jurczyk
Prevent use of uninitialized memory (originating from the stack frame of do_sysctl()) by verifying that the name array is filled with sufficient input data before comparing its specific entries with integer constants. Through timing measurement or analyzing the kernel debug logs, a user-mode program could potentially infer the results of comparisons against the uninitialized memory, and acquire some (very limited) information about the state of the kernel stack. The change also eliminates possible future warnings by tools such as KMSAN and other code checkers / instrumentations. Link: http://lkml.kernel.org/r/20170524122139.21333-1-mjurczyk@google.com Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Matthew Whitehead <tedheadster@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12sysctl: add unsigned int range supportLuis R. Rodriguez
To keep parity with regular int interfaces provide the an unsigned int proc_douintvec_minmax() which allows you to specify a range of allowed valid numbers. Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an actual user for that. Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12sysctl: simplify unsigned int supportLuis R. Rodriguez
Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields") added proc_douintvec() to start help adding support for unsigned int, this however was only half the work needed. Two fixes have come in since then for the following issues: o Printing the values shows a negative value, this happens since do_proc_dointvec() and this uses proc_put_long() This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative flag for proc_douintvec"). o We can easily wrap around the int values: UINT_MAX is 4294967295, if we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we end up with 1. o We echo negative values in and they are accepted This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec"). It still also failed to be added to sysctl_check_table()... instead of adding it with the current implementation just provide a proper and simplified unsigned int support without any array unsigned int support with no negative support at all. Historically sysctl proc helpers have supported arrays, due to the complexity this adds though we've taken a step back to evaluate array users to determine if its worth upkeeping for unsigned int. An evaluation using Coccinelle has been done to perform a grammatical search to ask ourselves: o How many sysctl proc_dointvec() (int) users exist which likely should be moved over to proc_douintvec() (unsigned int) ? Answer: about 8 - Of these how many are array users ? Answer: Probably only 1 o How many sysctl array users exist ? Answer: about 12 This last question gives us an idea just how popular arrays: they are not. Array support should probably just be kept for strings. The identified uint ports are: drivers/infiniband/core/ucma.c - max_backlog drivers/infiniband/core/iwcm.c - default_backlog net/core/sysctl_net_core.c - rps_sock_flow_sysctl() net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool net/phonet/sysctl.c proc_local_port_range() The only possible array users is proc_local_port_range() but it does not seem worth it to add array support just for this given the range support works just as well. Unsigned int support should be desirable more for when you *need* more than INT_MAX or using int min/max support then does not suffice for your ranges. If you forget and by mistake happen to register an unsigned int proc entry with an array, the driver will fail and you will get something as follows: sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E <etc> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <etc> Call Trace: dump_stack+0x63/0x81 __register_sysctl_table+0x350/0x650 ? kmem_cache_alloc_trace+0x107/0x240 __register_sysctl_paths+0x1b3/0x1e0 ? 0xffffffffc005f000 register_sysctl_table+0x1f/0x30 test_sysctl_init+0x10/0x1000 [test_sysctl] do_one_initcall+0x52/0x1a0 ? kmem_cache_alloc_trace+0x107/0x240 do_init_module+0x5f/0x200 load_module+0x1867/0x1bd0 ? __symbol_put+0x60/0x60 SYSC_finit_module+0xdf/0x110 SyS_finit_module+0xe/0x10 entry_SYSCALL_64_fastpath+0x1e/0xad RIP: 0033:0x7f042b22d119 <etc> Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields") Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Cc: Liping Zhang <zlpnobody@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12sysctl: fold sysctl_writes_strict checks into helperLuis R. Rodriguez
The mode sysctl_writes_strict positional checks keep being copy and pasted as we add new proc handlers. Just add a helper to avoid code duplication. Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12sysctl: kdoc'ify sysctl_writes_strictLuis R. Rodriguez
Document the different sysctl_writes_strict modes in code. Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kdump: protect vmcoreinfo data under the crash memoryXunlei Pang
Currently vmcoreinfo data is updated at boot time subsys_initcall(), it has the risk of being modified by some wrong code during system is running. As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on, when using "crash", "makedumpfile", etc utility to parse this vmcore, we probably will get "Segmentation fault" or other unexpected errors. E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the system; 3) trigger kdump, then we obviously will fail to recognize the crash context correctly due to the corrupted vmcoreinfo. Now except for vmcoreinfo, all the crash data is well protected(including the cpu note which is fully updated in the crash path, thus its correctness is guaranteed). Given that vmcoreinfo data is a large chunk prepared for kdump, we better protect it as well. To solve this, we relocate and copy vmcoreinfo_data to the crash memory when kdump is loading via kexec syscalls. Because the whole crash memory will be protected by existing arch_kexec_protect_crashkres() mechanism, we naturally protect vmcoreinfo_data from write(even read) access under kernel direct mapping after kdump is loaded. Since kdump is usually loaded at the very early stage after boot, we can trust the correctness of the vmcoreinfo data copied. On the other hand, we still need to operate the vmcoreinfo safe copy when crash happens to generate vmcoreinfo_note again, we rely on vmap() to map out a new kernel virtual address and update to use this new one instead in the following crash_save_vmcoreinfo(). BTW, we do not touch vmcoreinfo_note, because it will be fully updated using the protected vmcoreinfo_data after crash which is surely correct just like the cpu crash note. Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Juergen Gross <jgross@suse.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for phdrXunlei Pang
vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE. Like explained in commit 77019967f06b ("kdump: fix exported size of vmcoreinfo note"), it should not affect the actual function, but we better fix it, also this change should be safe and backward compatible. After this, we can get rid of variable vmcoreinfo_max_size, let's use the corresponding macros directly, fewer variables means more safety for vmcoreinfo operation. [xlpang@redhat.com: fix build warning] Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kexec: move vmcoreinfo out of the kernel's .bss sectionXunlei Pang
As Eric said, "what we need to do is move the variable vmcoreinfo_note out of the kernel's .bss section. And modify the code to regenerate and keep this information in something like the control page. Definitely something like this needs a page all to itself, and ideally far away from any other kernel data structures. I clearly was not watching closely the data someone decided to keep this silly thing in the kernel's .bss section." This patch allocates extra pages for these vmcoreinfo_XXX variables, one advantage is that it enhances some safety of vmcoreinfo, because vmcoreinfo now is kept far away from other kernel data structures. Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Suggested-by: Eric Biederman <ebiederm@xmission.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/fork.c: virtually mapped stacks: do not disable interruptsChristoph Lameter
The reason to disable interrupts seems to be to avoid switching to a different processor while handling per cpu data using individual loads and stores. If we use per cpu RMV primitives we will not have to disable interrupts. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12fix a braino in compat_sys_getrlimit()Al Viro
Reported-and-tested-by: Meelis Roos <mroos@linux.ee> Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ftrace: Fix uninitialized variable in match_records()Dan Carpenter
My static checker complains that if "func" is NULL then "clear_filter" is uninitialized. This seems like it could be true, although it's possible something subtle is happening that I haven't seen. kernel/trace/ftrace.c:3844 match_records() error: uninitialized symbol 'clear_filter'. Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda Cc: stable@vger.kernel.org Fixes: f0a3b154bd7 ("ftrace: Clarify code for mod command") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-12ftrace: Remove an unneeded NULL checkDan Carpenter
"func" can't be NULL and it doesn't make sense to check because we've already derefenced it. Link: http://lkml.kernel.org/r/20170712073340.4enzeojeoupuds5a@mwanda Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-12cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() raceVikram Mulukutla
With a shared policy in place, when one of the CPUs in the policy is hotplugged out and then brought back online, sugov_stop() and sugov_start() are called in order. sugov_stop() removes utilization hooks for each CPU in the policy and does nothing else in the for_each_cpu() loop. sugov_start() on the other hand iterates through the CPUs in the policy and re-initializes the per-cpu structure _and_ adds the utilization hook. This implies that the scheduler is allowed to invoke a CPU's utilization update hook when the rest of the per-cpu structures have yet to be re-inited. Apart from some strange values in tracepoints this doesn't cause a problem, but if we do end up accessing a pointer from the per-cpu sugov_cpu structure somewhere in the sugov_update_shared() path, we will likely see crashes since the memset for another CPU in the policy is free to race with sugov_update_shared from the CPU that is ready to go. So let's fix this now to first init all per-cpu structures, and then add the per-cpu utilization update hooks all at once. Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-11ftrace: Hide cached module code for !CONFIG_MODULESArnd Bergmann
When modules are disabled, we get a harmless build warning: kernel/trace/ftrace.c:4051:13: error: 'process_cached_mods' defined but not used [-Werror=unused-function] This adds the same #ifdef around the new code that exists around its caller. Link: http://lkml.kernel.org/r/20170710084413.1820568-1-arnd@arndb.de Fixes: d7fbf8df7ca0 ("ftrace: Implement cached modules tracing on module load") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-11tracing: Do note expose stack_trace_filter without DYNAMIC_FTRACESteven Rostedt (VMware)
The "stack_trace_filter" file only makes sense if DYNAMIC_FTRACE is configured in. If it is not, then the user can not filter any functions. Not only that, the open function causes warnings when DYNAMIC_FTRACE is not set. Link: http://lkml.kernel.org/r/20170710110521.600806-1-arnd@arndb.de Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-11tracing: Fixup trace file header alignmentSteven Rostedt (VMware)
The addition of TGID to the tracing header added a check to see if TGID shoudl be displayed or not, and updated the header accordingly. Unfortunately, it broke the default header. Also add constant strings to use for spacing. This does remove the visibility of the header a bit, but cuts it down from the extended lines much greater than 80 characters. Before this change: # tracer: function # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU#|||| TIMESTAMP FUNCTION # | | | |||| | | swapper/0-1 [000] .... 0.277830: migration_init <-do_one_initcall swapper/0-1 [002] d... 13.861967: Unknown type 1201 swapper/0-1 [002] d..1 13.861970: Unknown type 1202 After this change: # tracer: function # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | swapper/0-1 [000] .... 0.278245: migration_init <-do_one_initcall swapper/0-1 [003] d... 13.861189: Unknown type 1201 swapper/0-1 [003] d..1 13.861192: Unknown type 1202 Cc: Joel Fernandes <joelaf@google.com> Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - most of the rest of MM - KASAN updates - lib/ updates - checkpatch updates - some binfmt_elf changes - various misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits) kernel/exit.c: avoid undefined behaviour when calling wait4() kernel/signal.c: avoid undefined behaviour in kill_something_info binfmt_elf: safely increment argv pointers s390: reduce ELF_ET_DYN_BASE powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB arm64: move ELF_ET_DYN_BASE to 4GB / 4MB arm: move ELF_ET_DYN_BASE to 4MB binfmt_elf: use ELF_ET_DYN_BASE only for PIE fs, epoll: short circuit fetching events if thread has been killed checkpatch: improve multi-line alignment test checkpatch: improve macro reuse test checkpatch: change format of --color argument to --color[=WHEN] checkpatch: silence perl 5.26.0 unescaped left brace warnings checkpatch: improve tests for multiple line function definitions checkpatch: remove false warning for commit reference checkpatch: fix stepping through statements with $stat and ctx_statement_block checkpatch: [HLP]LIST_HEAD is also declaration checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t checkpatch: improve the unnecessary OOM message test lib/bsearch.c: micro-optimize pivot position calculation ...
2017-07-10kernel/exit.c: avoid undefined behaviour when calling wait4()zhongjiang
wait4(-2147483648, 0x20, 0, 0xdd0000) triggers: UBSAN: Undefined behaviour in kernel/exit.c:1651:9 The related calltrace is as follows: negation of -2147483648 cannot be represented in type 'int': CPU: 9 PID: 16482 Comm: zj Tainted: G B ---- ------- 3.10.0-327.53.58.71.x86_64+ #66 Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA , BIOS CTSAV036 04/27/2011 Call Trace: dump_stack+0x19/0x1b ubsan_epilogue+0xd/0x50 __ubsan_handle_negate_overflow+0x109/0x14e SyS_wait4+0x1cb/0x1e0 system_call_fastpath+0x16/0x1b Exclude the overflow to avoid the UBSAN warning. Link: http://lkml.kernel.org/r/1497264618-20212-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhongjiang <zhongjiang@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10kernel/signal.c: avoid undefined behaviour in kill_something_infozhongjiang
When running kill(72057458746458112, 0) in userspace I hit the following issue. UBSAN: Undefined behaviour in kernel/signal.c:1462:11 negation of -2147483648 cannot be represented in type 'int': CPU: 226 PID: 9849 Comm: test Tainted: G B ---- ------- 3.10.0-327.53.58.70.x86_64_ubsan+ #116 Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014 Call Trace: dump_stack+0x19/0x1b ubsan_epilogue+0xd/0x50 __ubsan_handle_negate_overflow+0x109/0x14e SYSC_kill+0x43e/0x4d0 SyS_kill+0xe/0x10 system_call_fastpath+0x16/0x1b Add code to avoid the UBSAN detection. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhongjiang <zhongjiang@huawei.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10lib/extable.c: use bsearch() library function in search_extable()Thomas Meyer
[thomas@m3y3r.de: v3: fix arch specific implementations] Link: http://lkml.kernel.org/r/1497890858.12931.7.camel@m3y3r.de Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>