summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-07-24rcutorture: Print SRCU lock/unlock totalsPaul E. McKenney
This commit adds printing of SRCU lock/unlock totals, which are just the sums of the per-CPU counts. Saves a bit of mental arithmetic. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-24rcutorture: Move SRCU status printing to SRCU implementationsPaul E. McKenney
This commit gets rid of some ugly #ifdefs in rcutorture.c by moving the SRCU status printing to the SRCU implementations. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-24srcu: Make process_srcu() be staticPaul E. McKenney
The function process_srcu() is not invoked outside of srcutree.c, so this commit makes it static and drops the EXPORT_SYMBOL_GPL(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-24srcu: Move rcu_scheduler_starting() from Tiny RCU to Tiny SRCUPaul E. McKenney
Other than lockdep support, Tiny RCU has no need for the scheduler status. However, Tiny SRCU will need this to control boot-time behavior independent of lockdep. Therefore, this commit moves rcu_scheduler_starting() from kernel/rcu/tiny_plugin.h to kernel/rcu/srcutiny.c. This in turn allows the complete removal of kernel/rcu/tiny_plugin.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-24PM / suspend: Define pr_fmt() in suspend.cRafael J. Wysocki
Define a common prefix ("PM:") for messages printed by the code in kernel/power/suspend.c. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2017-07-24PM / suspend: Use mem_sleep_labels[] strings in messagesRafael J. Wysocki
Some messages in suspend.c currently print state names from pm_states[], but that may be confusing if the mem_sleep sysfs attribute is changed to anything different from "mem", because in those cases the messages will say either "freeze" or "standby" after writing "mem" to /sys/power/state. To avoid the confusion, use mem_sleep_labels[] strings in those messages instead. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2017-07-24PM / sleep: Put pm_test under CONFIG_PM_SLEEP_DEBUGRafael J. Wysocki
The pm_test sysfs attribute is under CONFIG_PM_DEBUG, but it doesn't make sense to provide it if CONFIG_PM_SLEEP is unset, so put it under CONFIG_PM_SLEEP_DEBUG instead. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-24PM / sleep: Check pm_wakeup_pending() in __device_suspend_noirq()Rafael J. Wysocki
Restore the pm_wakeup_pending() check in __device_suspend_noirq() removed by commit eed4d47efe95 (ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle) as that allows the function to return earlier if there's a wakeup event pending already (so that it may spend less time on carrying out operations that will be reversed shortly anyway) and rework the main suspend-to-idle loop to take that optimization into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-24PM / s2idle: Rearrange the main suspend-to-idle loopRafael J. Wysocki
As a preparation for subsequent changes, rearrange the core suspend-to-idle code by moving the initial invocation of dpm_suspend_noirq() into s2idle_loop(). This also causes debug messages from that code to appear in a less confusing order. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-24bpf/verifier: fix min/max handling in BPF_SUBEdward Cree
We have to subtract the src max from the dst min, and vice-versa, since (e.g.) the smallest result comes from the largest subtrahend. Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24signal: Fix sending signals with siginfoEric W. Biederman
Today sending a signal with rt_sigqueueinfo and receving it on a signalfd does not work reliably. The issue is that reading a signalfd instead of returning a siginfo returns a signalfd_siginfo and the kernel must convert from one to the other. The kernel does not currently have the code to deduce which union members of struct siginfo are in use. In this patchset I fix that by introducing a new function siginfo_layout that can look at a siginfo and report which union member of struct siginfo is in use. Before that I clean up how we populate struct siginfo. The siginfo structure has two key members si_signo and si_code. Some si_codes are signal specific and for those it takes si_signo and si_code to indicate the members of siginfo that are valid. The rest of the si_code values are signal independent like SI_USER, SI_KERNEL, SI_QUEUE, and SI_TIMER and only si_code is needed to indicate which members of siginfo are valid. At least that is how POSIX documents them, and how common sense would indicate they should function. In practice we have been rather sloppy about maintaining the ABI in linux and we have some exceptions. We have a couple of buggy architectures that make SI_USER mean something different when combined with SIGFPE or SIGTRAP. Worse we have fcntl(F_SETSIG) which results in the si_codes POLL_IN, POLL_OUT, POLL_MSG, POLL_ERR, POLL_PRI, POLL_HUP being sent with any arbitrary signal, while the values are in a range that overlaps the signal specific si_codes. Thankfully the ambiguous cases with the POLL_NNN si_codes are for things no sane persion would do that so we can rectify the situtation. AKA no one cares so we won't cause a regression fixing it. As part of fixing this I stop leaking the __SI_xxxx codes to userspace and stop storing them in the high 16bits of si_code. Making the kernel code fundamentally simpler. We have already confirmed that the one application that would see this difference in kernel behavior CRIU won't be affected by this change as it copies values verbatim from one kernel interface to another. v3: - Corrected the patches so they bisect properly v2: - Benchmarked the code to confirm no performance changes are visible. - Reworked the first couple of patches so that TRAP_FIXME and FPE_FIXME are not exported to userspace. - Rebased on top of the siginfo cleanup that came in v4.13-rc1 - Updated alpha to use both TRAP_FIXME and FPE_FIXME Eric W. Biederman (7): signal/alpha: Document a conflict with SI_USER for SIGTRAP signal/ia64: Document a conflict with SI_USER with SIGFPE signal/sparc: Document a conflict with SI_USER with SIGFPE signal/mips: Document a conflict with SI_USER with SIGFPE signal/testing: Don't look for __SI_FAULT in userspace fcntl: Don't use ambiguous SIG_POLL si_codes signal: Remove kernel interal si_code magic Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-24signal: Remove kernel interal si_code magicEric W. Biederman
struct siginfo is a union and the kernel since 2.4 has been hiding a union tag in the high 16bits of si_code using the values: __SI_KILL __SI_TIMER __SI_POLL __SI_FAULT __SI_CHLD __SI_RT __SI_MESGQ __SI_SYS While this looks plausible on the surface, in practice this situation has not worked well. - Injected positive signals are not copied to user space properly unless they have these magic high bits set. - Injected positive signals are not reported properly by signalfd unless they have these magic high bits set. - These kernel internal values leaked to userspace via ptrace_peek_siginfo - It was possible to inject these kernel internal values and cause the the kernel to misbehave. - Kernel developers got confused and expected these kernel internal values in userspace in kernel self tests. - Kernel developers got confused and set si_code to __SI_FAULT which is SI_USER in userspace which causes userspace to think an ordinary user sent the signal and that it was not kernel generated. - The values make it impossible to reorganize the code to transform siginfo_copy_to_user into a plain copy_to_user. As si_code must be massaged before being passed to userspace. So remove these kernel internal si codes and make the kernel code simpler and more maintainable. To replace these kernel internal magic si_codes introduce the helper function siginfo_layout, that takes a signal number and an si_code and computes which union member of siginfo is being used. Have siginfo_layout return an enumeration so that gcc will have enough information to warn if a switch statement does not handle all of union members. A couple of architectures have a messed up ABI that defines signal specific duplications of SI_USER which causes more special cases in siginfo_layout than I would like. The good news is only problem architectures pay the cost. Update all of the code that used the previous magic __SI_ values to use the new SIL_ values and to call siginfo_layout to get those values. Escept where not all of the cases are handled remove the defaults in the switch statements so that if a new case is missed in the future the lack will show up at compile time. Modify the code that copies siginfo si_code to userspace to just copy the value and not cast si_code to a short first. The high bits are no longer used to hold a magic union member. Fixup the siginfo header files to stop including the __SI_ values in their constants and for the headers that were missing it to properly update the number of si_codes for each signal type. The fixes to copy_siginfo_from_user32 implementations has the interesting property that several of them perviously should never have worked as the __SI_ values they depended up where kernel internal. With that dependency gone those implementations should work much better. The idea of not passing the __SI_ values out to userspace and then not reinserting them has been tested with criu and criu worked without changes. Ref: 2.4.0-test1 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-23cgroup: fix error return value from cgroup_subtree_control()Tejun Heo
While refactoring, f7b2814bb9b6 ("cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()") broke error return value from the function. The return value from the last operation is always overridden to zero. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-23PM / timekeeping: Print debug messages when requestedRafael J. Wysocki
The messages printed by tk_debug_account_sleep_time() are basically useful for system sleep debugging, so print them only when the other debug messages from the core suspend/hibernate code are enabled. While at it, make it clear that the messages from tk_debug_account_sleep_time() are about timekeeping suspend duration, because in general timekeeping may be suspeded and resumed for multiple times during one system suspend-resume cycle. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22PM / sleep: Mark suspend/hibernation start and finishRafael J. Wysocki
Regardless of whether or not debug messages from the core system suspend/hibernation code are enabled, it is useful to know when system-wide transitions start and finish (or fail), so print "info" messages at these points. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Mark Salyzyn <salyzyn@android.com>
2017-07-22PM / sleep: Do not print debug messages by defaultRafael J. Wysocki
Debug messages from the system suspend/hibernation infrastructure can fill up the entire kernel log buffer in some cases and anyway they are only useful for debugging. They depend on CONFIG_PM_DEBUG, but that is set as a rule as some generally useful diagnostic facilities depend on it too. For this reason, avoid printing those messages by default, but make it possible to turn them on as needed with the help of a new sysfs attribute under /sys/power/. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22PM / suspend: Export pm_suspend_target_stateFlorian Fainelli
Have the core suspend/resume framework store the system-wide suspend state (suspend_state_t) we are about to enter, and expose it to drivers via pm_suspend_target_state in order to retrieve that. The state is assigned in suspend_devices_and_enter(). This is useful for platform specific drivers that may need to take a slightly different suspend/resume path based on the system's suspend/resume state being entered. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22cpufreq: Use transition_delay_us for legacy governors as wellViresh Kumar
The policy->transition_delay_us field is used only by the schedutil governor currently, and this field describes how fast the driver wants the cpufreq governor to change CPUs frequency. It should rather be a common thing across all governors, as it doesn't have any schedutil dependency here. Create a new helper cpufreq_policy_transition_delay_us() to get the transition delay across all governors. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22device property: Get rid of struct fwnode_handle type fieldSakari Ailus
Instead of relying on the struct fwnode_handle type field, define fwnode_operations structs for all separate types of fwnodes. To find out the type, compare to the ops field to relevant ops structs. This change has two benefits: 1. it avoids adding the type field to each and every instance of struct fwnode_handle, thus saving memory and 2. makes the ops field the single factor that defines both the types of the fwnode as well as defines the implementation of its operations, decreasing the possibility of bugs when developing code dealing with fwnode internals. Suggested-by: Rob Herring <robh@kernel.org> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-21Merge tag 'trace-v4.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Three minor updates - Use the new GFP_RETRY_MAYFAIL to be more aggressive in allocating memory for the ring buffer without causing OOMs - Fix a memory leak in adding and removing instances - Add __rcu annotation to be able to debug RCU usage of function tracing a bit better" * tag 'trace-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: trace: fix the errors caused by incompatible type of RCU variables tracing: Fix kmemleak in instance_rmdir tracing/ring_buffer: Try harder to allocate
2017-07-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A cputime fix and code comments/organization fix to the deadline scheduler" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix confusing comments about selection of top pi-waiter sched/cputime: Don't use smp_processor_id() in preemptible context
2017-07-21Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two hw-enablement patches, two race fixes, three fixes for regressions of semantics, plus a number of tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add proper condition to run sched_task callbacks perf/core: Fix locking for children siblings group read perf/core: Fix scheduling regression of pinned groups perf/x86/intel: Fix debug_store reset field for freq events perf/x86/intel: Add Goldmont Plus CPU PMU support perf/x86/intel: Enable C-state residency events for Apollo Lake perf symbols: Accept zero as the kernel base address Revert "perf/core: Drop kernel samples even though :u is specified" perf annotate: Fix broken arrow at row 0 connecting jmp instruction to its target perf evsel: State in the default event name if attr.exclude_kernel is set perf evsel: Fix attr.exclude_kernel setting for default cycles:p
2017-07-21Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixlet from Ingo Molnar: "Remove an unnecessary priority adjustment in the rtmutex code" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Remove unnecessary priority adjustment
2017-07-21Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A resume_irq() fix, plus a number of static declaration fixes" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/digicolor: Drop unnecessary static irqchip/mips-cpu: Drop unnecessary static irqchip/gic/realview: Drop unnecessary static irqchip/mips-gic: Remove population of irq domain names genirq/PM: Properly pretend disabled state when force resuming interrupts
2017-07-21cgroup: update debug controller to print out thread mode informationWaiman Long
Update debug controller so that it prints out debug info about thread mode. 1) The relationship between proc_cset and threaded_csets are displayed. 2) The status of being a thread root or threaded cgroup is displayed. This patch is extracted from Waiman's larger patch. v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links file as the same information is available from cgroup.type. Suggested by Waiman. - Threaded marking is moved to the previous patch. Patch-originally-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: implement cgroup v2 thread supportTejun Heo
This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. A cgroup is always created as a domain and can be made threaded by writing to the "cgroup.type" file. When a cgroup becomes threaded, it becomes a member of a threaded subtree which is anchored at the closest ancestor which isn't threaded. The threads of the processes which are in a threaded subtree can be placed anywhere without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, the nearest ancestor which isn't threaded, is called the threaded domain and serves as the resource domain for the whole subtree. This is the last cgroup where domain controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. As the root cgroup is exempt from the no-internal-process constraint, it can serve as both a threaded domain and a parent to normal cgroups, so, unlike non-root cgroups, the root cgroup can have both domain and threaded children. Internally, in a threaded subtree, each css_set has its ->dom_cset pointing to a matching css_set which belongs to the threaded domain. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create(). Spotted by Waiman. - Documentation updated as suggested by Waiman. - cgroup.type content slightly reformatted. - Mark the debug controller threaded. v4: - Updated to the general idea of marking specific cgroups domain/threaded as suggested by PeterZ. v3: - Dropped "join" and always make mixed children join the parent's threaded subtree. v2: - After discussions with Waiman, support for mixed thread mode is added. This should address the issue that Peter pointed out where any nesting should be avoided for thread subtrees while coexisting with other domain cgroups. - Enabling / disabling thread mode now piggy backs on the existing control mask update mechanism. - Bug fixes and cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21cgroup: implement CSS_TASK_ITER_THREADEDTejun Heo
cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the dom_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a dom_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new enable-threaded-per-cgroup behavior. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: introduce cgroup->dom_cgrp and threaded css_set handlingTejun Heo
cgroup v2 is in the process of growing thread granularity support. A threaded subtree is composed of a thread root and threaded cgroups which are proper members of the subtree. The root cgroup of the subtree serves as the domain cgroup to which the processes (as opposed to threads / tasks) of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. Inside the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->dom_cgrp along with threaded css_set handling. * cgroup->dom_cgrp points to self for normal and thread roots. For proper thread subtree members, points to the dom_cgrp (the thread root). * css_set->dom_cset points to self if for normal and thread roots. If threaded, points to the css_set which belongs to the cgrp->dom_cgrp. The dom_cgrp serves as the resource domain and keeps the matching csses available. The dom_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their dom_csets to enable iteration of all threaded tasks. * cgroup->nr_threaded_children keeps track of the number of threaded children. This patch adds the above but doesn't actually use them yet. The following patches will build on top. v4: ->nr_threaded_children added. v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new enable-threaded-per-cgroup behavior. v2: Added cgroup_is_threaded() helper. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCSTejun Heo
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21cgroup: reorganize cgroup.procs / task write pathTejun Heo
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all handled by __cgroup_procs_write() on both v1 and v2. This patch reoragnizes the write path so that there are common helper functions that different write paths use. While this somewhat increases LOC, the different paths are no longer intertwined and each path has more flexibility to implement different behaviors which will be necessary for the planned v2 thread support. v3: - Restructured so that cgroup_procs_write_permission() takes @src_cgrp and @dst_cgrp. v2: - Rolled in Waiman's task reference count fix. - Updated on top of nsdelegate changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com>
2017-07-21perf/core: Fix locking for children siblings group readJiri Olsa
We're missing ctx lock when iterating children siblings within the perf_read path for group reading. Following race and crash can happen: User space doing read syscall on event group leader: T1: perf_read lock event->ctx->mutex perf_read_group lock leader->child_mutex __perf_read_group_add(child) list_for_each_entry(sub, &leader->sibling_list, group_entry) ----> sub might be invalid at this point, because it could get removed via perf_event_exit_task_context in T2 Child exiting and cleaning up its events: T2: perf_event_exit_task_context lock ctx->mutex list_for_each_entry_safe(child_event, next, &child_ctx->event_list,... perf_event_exit_event(child) lock ctx->lock perf_group_detach(child) unlock ctx->lock ----> child is removed from sibling_list without any sync with T1 path above ... free_event(child) Before the child is removed from the leader's child_list, (and thus is omitted from perf_read_group processing), we need to ensure that perf_read_group touches child's siblings under its ctx->lock. Peter further notes: | One additional note; this bug got exposed by commit: | | ba5213ae6b88 ("perf/core: Correct event creation with PERF_FORMAT_GROUP") | | which made it possible to actually trigger this code-path. Tested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ba5213ae6b88 ("perf/core: Correct event creation with PERF_FORMAT_GROUP") Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) BPF verifier signed/unsigned value tracking fix, from Daniel Borkmann, Edward Cree, and Josef Bacik. 2) Fix memory allocation length when setting up calls to ->ndo_set_mac_address, from Cong Wang. 3) Add a new cxgb4 device ID, from Ganesh Goudar. 4) Fix FIB refcount handling, we have to set it's initial value before the configure callback (which can bump it). From David Ahern. 5) Fix double-free in qcom/emac driver, from Timur Tabi. 6) A bunch of gcc-7 string format overflow warning fixes from Arnd Bergmann. 7) Fix link level headroom tests in ip_do_fragment(), from Vasily Averin. 8) Fix chunk walking in SCTP when iterating over error and parameter headers. From Alexander Potapenko. 9) TCP BBR congestion control fixes from Neal Cardwell. 10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger. 11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong Wang. 12) xmit_recursion in ppp driver needs to be per-device not per-cpu, from Gao Feng. 13) Cannot release skb->dst in UDP if IP options processing needs it. From Paolo Abeni. 14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander Levin and myself. 15) Revert some rtnetlink notification changes that are causing regressions, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) net: bonding: Fix transmit load balancing in balance-alb mode rds: Make sure updates to cp_send_gen can be observed net: ethernet: ti: cpsw: Push the request_irq function to the end of probe ipv4: initialize fib_trie prior to register_netdev_notifier call. rtnetlink: allocate more memory for dev_set_mac_address() net: dsa: b53: Add missing ARL entries for BCM53125 bpf: more tests for mixed signed and unsigned bounds checks bpf: add test for mixed signed and unsigned bounds checks bpf: fix up test cases with mixed signed/unsigned bounds bpf: allow to specify log level and reduce it for test_verifier bpf: fix mixed signed/unsigned derived min/max value bounds ipv6: avoid overflow of offset in ip6_find_1stfragopt net: tehuti: don't process data if it has not been copied from userspace Revert "rtnetlink: Do not generate notifications for CHANGEADDR event" net: dsa: mv88e6xxx: Enable CMODE config support for 6390X dt-binding: ptp: Add SoC compatibility strings for dte ptp clock NET: dwmac: Make dwmac reset unconditional net: Zero terminate ifr_name in dev_ifname(). wireless: wext: terminate ifr name coming from userspace netfilter: fix netfilter_net_init() return ...
2017-07-20bpf: fix mixed signed/unsigned derived min/max value boundsDaniel Borkmann
Edward reported that there's an issue in min/max value bounds tracking when signed and unsigned compares both provide hints on limits when having unknown variables. E.g. a program such as the following should have been rejected: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff8a94cda93400 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 11: (65) if r1 s> 0x1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 14: (b7) r0 = 0 15: (95) exit What happens is that in the first part ... 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 ... r1 carries an unsigned value, and is compared as unsigned against a register carrying an immediate. Verifier deduces in reg_set_min_max() that since the compare is unsigned and operation is greater than (>), that in the fall-through/false case, r1's minimum bound must be 0 and maximum bound must be r2. Latter is larger than the bound and thus max value is reset back to being 'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now 'R1=inv,min_value=0'. The subsequent test ... 11: (65) if r1 s> 0x1 goto pc+2 ... is a signed compare of r1 with immediate value 1. Here, verifier deduces in reg_set_min_max() that since the compare is signed this time and operation is greater than (>), that in the fall-through/false case, we can deduce that r1's maximum bound must be 1, meaning with prior test, we result in r1 having the following state: R1=inv,min_value=0,max_value=1. Given that the actual value this holds is -8, the bounds are wrongly deduced. When this is being added to r0 which holds the map_value(_adj) type, then subsequent store access in above case will go through check_mem_access() which invokes check_map_access_adj(), that will then probe whether the map memory is in bounds based on the min_value and max_value as well as access size since the actual unknown value is min_value <= x <= max_value; commit fce366a9dd0d ("bpf, verifier: fix alu ops against map_value{, _adj} register types") provides some more explanation on the semantics. It's worth to note in this context that in the current code, min_value and max_value tracking are used for two things, i) dynamic map value access via check_map_access_adj() and since commit 06c1c049721a ("bpf: allow helpers access to variable memory") ii) also enforced at check_helper_mem_access() when passing a memory address (pointer to packet, map value, stack) and length pair to a helper and the length in this case is an unknown value defining an access range through min_value/max_value in that case. The min_value/max_value tracking is /not/ used in the direct packet access case to track ranges. However, the issue also affects case ii), for example, the following crafted program based on the same principle must be rejected as well: 0: (b7) r2 = 0 1: (bf) r3 = r10 2: (07) r3 += -512 3: (7a) *(u64 *)(r10 -16) = -8 4: (79) r4 = *(u64 *)(r10 -16) 5: (b7) r6 = -1 6: (2d) if r4 > r6 goto pc+5 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 7: (65) if r4 s> 0x1 goto pc+4 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 8: (07) r4 += 1 9: (b7) r5 = 0 10: (6a) *(u16 *)(r10 -512) = 0 11: (85) call bpf_skb_load_bytes#26 12: (b7) r0 = 0 13: (95) exit Meaning, while we initialize the max_value stack slot that the verifier thinks we access in the [1,2] range, in reality we pass -7 as length which is interpreted as u32 in the helper. Thus, this issue is relevant also for the case of helper ranges. Resetting both bounds in check_reg_overflow() in case only one of them exceeds limits is also not enough as similar test can be created that uses values which are within range, thus also here learned min value in r1 is incorrect when mixed with later signed test to create a range: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff880ad081fa00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (65) if r1 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 14: (b7) r0 = 0 15: (95) exit This leaves us with two options for fixing this: i) to invalidate all prior learned information once we switch signed context, ii) to track min/max signed and unsigned boundaries separately as done in [0]. (Given latter introduces major changes throughout the whole verifier, it's rather net-next material, thus this patch follows option i), meaning we can derive bounds either from only signed tests or only unsigned tests.) There is still the case of adjust_reg_min_max_vals(), where we adjust bounds on ALU operations, meaning programs like the following where boundaries on the reg get mixed in context later on when bounds are merged on the dst reg must get rejected, too: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff89b2bf87ce00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+6 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (b7) r7 = 1 12: (65) if r7 s> 0x0 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp 13: (b7) r0 = 0 14: (95) exit from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp 15: (0f) r7 += r1 16: (65) if r7 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 17: (0f) r0 += r7 18: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 19: (b7) r0 = 0 20: (95) exit Meaning, in adjust_reg_min_max_vals() we must also reset range values on the dst when src/dst registers have mixed signed/ unsigned derived min/max value bounds with one unbounded value as otherwise they can be added together deducing false boundaries. Once both boundaries are established from either ALU ops or compare operations w/o mixing signed/unsigned insns, then they can safely be added to other regs also having both boundaries established. Adding regs with one unbounded side to a map value where the bounded side has been learned w/o mixing ops is possible, but the resulting map value won't recover from that, meaning such op is considered invalid on the time of actual access. Invalid bounds are set on the dst reg in case i) src reg, or ii) in case dst reg already had them. The only way to recover would be to perform i) ALU ops but only 'add' is allowed on map value types or ii) comparisons, but these are disallowed on pointers in case they span a range. This is fine as only BPF_JEQ and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers which potentially turn them into PTR_TO_MAP_VALUE type depending on the branch, so only here min/max value cannot be invalidated for them. In terms of state pruning, value_from_signed is considered as well in states_equal() when dealing with adjusted map values. With regards to breaking existing programs, there is a small risk, but use-cases are rather quite narrow where this could occur and mixing compares probably unlikely. Joint work with Josef and Edward. [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html Fixes: 484611357c19 ("bpf: allow access into map value arrays") Reported-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-20Merge branch 'stable-4.13' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds
Pull audit fix from Paul Moore: "A small audit fix, just a single line, to plug a memory leak in some audit error handling code" * 'stable-4.13' of git://git.infradead.org/users/pcmoore/audit: audit: fix memleak in auditd_send_unicast_skb.
2017-07-20smp/hotplug: Handle removal correctly in cpuhp_store_callbacks()Ethan Barnes
If cpuhp_store_callbacks() is called for CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN, which are the indicators for dynamically allocated states, then cpuhp_store_callbacks() allocates a new dynamic state. The first allocation in each range returns CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN. If cpuhp_remove_state() is invoked for one of these states, then there is no protection against the allocation mechanism. So the removal, which should clear the callbacks and the name, gets a new state assigned and clears that one. As a consequence the state which should be cleared stays initialized. A consecutive CPU hotplug operation dereferences the state callbacks and accesses either freed or reused memory, resulting in crashes. Add a protection against this by checking the name argument for NULL. If it's NULL it's a removal. If not, it's an allocation. [ tglx: Added a comment and massaged changelog ] Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface") Signed-off-by: Ethan Barnes <ethan.barnes@sandisk.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.or> Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu> Cc: Sebastian Siewior <bigeasy@linutronix.d> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/DM2PR04MB398242FC7776D603D9F99C894A60@DM2PR04MB398.namprd04.prod.outlook.com
2017-07-20trace: fix the errors caused by incompatible type of RCU variablesChunyan Zhang
The variables which are processed by RCU functions should be annotated as RCU, otherwise sparse will report the errors like below: "error: incompatible types in comparison expression (different address spaces)" Link: http://lkml.kernel.org/r/1496823171-7758-1-git-send-email-zhang.chunyan@linaro.org Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org> [ Updated to not be 100% 80 column strict ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-20tracing: Fix kmemleak in instance_rmdirChunyu Hu
Hit the kmemleak when executing instance_rmdir, it forgot releasing mem of tracing_cpumask. With this fix, the warn does not appear any more. unreferenced object 0xffff93a8dfaa7c18 (size 8): comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s) hex dump (first 8 bytes): ff ff ff ff ff ff ff ff ........ backtrace: [<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280 [<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30 [<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10 [<ffffffff88571ab0>] instance_mkdir+0x90/0x240 [<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70 [<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0 [<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100 [<ffffffff88403857>] do_syscall_64+0x67/0x150 [<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a [<ffffffffffffffff>] 0xffffffffffffffff Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: ccfe9e42e451 ("tracing: Make tracing_cpumask available for all instances") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-20prctl: Allow local CAP_SYS_ADMIN changing exe_fileKirill Tkhai
During checkpointing and restore of userspace tasks we bumped into the situation, that it's not possible to restore the tasks, which user namespace does not have uid 0 or gid 0 mapped. People create user namespace mappings like they want, and there is no a limitation on obligatory uid and gid "must be mapped". So, if there is no uid 0 or gid 0 in the mapping, it's impossible to restore mm->exe_file of the processes belonging to this user namespace. Also, there is no a workaround. It's impossible to create a temporary uid/gid mapping, because only one write to /proc/[pid]/uid_map and gid_map is allowed during a namespace lifetime. If there is an entry, then no more mapings can't be written. If there isn't an entry, we can't write there too, otherwise user task won't be able to do that in the future. The patch changes the check, and looks for CAP_SYS_ADMIN instead of zero uid and gid. This allows to restore a task independently of its user namespace mappings. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Serge Hallyn <serge@hallyn.com> CC: "Eric W. Biederman" <ebiederm@xmission.com> CC: Oleg Nesterov <oleg@redhat.com> CC: Michal Hocko <mhocko@suse.com> CC: Andrei Vagin <avagin@openvz.org> CC: Cyrill Gorcunov <gorcunov@openvz.org> CC: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com> CC: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-07-20userns,pidns: Verify the userns for new pid namespacesEric W. Biederman
It is pointless and confusing to allow a pid namespace hierarchy and the user namespace hierarchy to get out of sync. The owner of a child pid namespace should be the owner of the parent pid namespace or a descendant of the owner of the parent pid namespace. Otherwise it is possible to construct scenarios where a process has a capability over a parent pid namespace but does not have the capability over a child pid namespace. Which confusingly makes permission checks non-transitive. It requires use of setns into a pid namespace (but not into a user namespace) to create such a scenario. Add the function in_userns to help in making this determination. v2: Optimized in_userns by using level as suggested by: Kirill Tkhai <ktkhai@virtuozzo.com> Ref: 49f4d8b93ccf ("pidns: Capture the user namespace and filter ns_last_pid") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-20perf/core: Fix scheduling regression of pinned groupsAlexander Shishkin
Vince Weaver reported: > I was tracking down some regressions in my perf_event_test testsuite. > Some of the tests broke in the 4.11-rc1 timeframe. > > I've bisected one of them, this report is about > tests/overflow/simul_oneshot_group_overflow > This test creates an event group containing two sampling events, set > to overflow to a signal handler (which disables and then refreshes the > event). > > On a good kernel you get the following: > Event perf::instructions with period 1000000 > Event perf::instructions with period 2000000 > fd 3 overflows: 946 (perf::instructions/1000000) > fd 4 overflows: 473 (perf::instructions/2000000) > Ending counts: > Count 0: 946379875 > Count 1: 946365218 > > With the broken kernels you get: > Event perf::instructions with period 1000000 > Event perf::instructions with period 2000000 > fd 3 overflows: 938 (perf::instructions/1000000) > fd 4 overflows: 318 (perf::instructions/2000000) > Ending counts: > Count 0: 946373080 > Count 1: 653373058 The root cause of the bug is that the following commit: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts") erronously assumed that event's 'pinned' setting determines whether the event belongs to a pinned group or not, but in fact, it's the group leader's pinned state that matters. This was discovered by Vince in the test case described above, where two instruction counters are grouped, the group leader is pinned, but the other event is not; in the regressed case the counters were off by 33% (the difference between events' periods), but should be the same within the error margin. Fix the problem by looking at the group leader's pinning. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts") Link: http://lkml.kernel.org/r/87lgnmvw7h.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-19Merge tag 'gcc-plugins-v4.13-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull structure randomization updates from Kees Cook: "Now that IPC and other changes have landed, enable manual markings for randstruct plugin, including the task_struct. This is the rest of what was staged in -next for the gcc-plugins, and comes in three patches, largest first: - mark "easy" structs with __randomize_layout - mark task_struct with an optional anonymous struct to isolate the __randomize_layout section - mark structs to opt _out_ of automated marking (which will come later) And, FWIW, this continues to pass allmodconfig (normal and patched to enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and s390 for me" * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: randstruct: opt-out externally exposed function pointer structs task_struct: Allow randomized layout randstruct: Mark various structs for randomization
2017-07-19workqueue: restore WQ_UNBOUND/max_active==1 to be orderedTejun Heo
The combination of WQ_UNBOUND and max_active == 1 used to imply ordered execution. After NUMA affinity 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues"), this is no longer true due to per-node worker pools. While the right way to create an ordered workqueue is alloc_ordered_workqueue(), the documentation has been misleading for a long time and people do use WQ_UNBOUND and max_active == 1 for ordered workqueues which can lead to subtle bugs which are very difficult to trigger. It's unlikely that we'd see noticeable performance impact by enforcing ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's automatically set __WQ_ORDERED for those workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: Alexei Potashnik <alexei@purestorage.com> Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues") Cc: stable@vger.kernel.org # v3.10+
2017-07-19audit: fix memleak in auditd_send_unicast_skb.Shu Wang
Found this issue by kmemleak report, auditd_send_unicast_skb did not free skb if rcu_dereference(auditd_conn) returns null. unreferenced object 0xffff88082568ce00 (size 256): comm "auditd", pid 1119, jiffies 4294708499 backtrace: [<ffffffff8176166a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121820c>] kmem_cache_alloc_node+0xcc/0x210 [<ffffffff8161b99d>] __alloc_skb+0x5d/0x290 [<ffffffff8113c614>] audit_make_reply+0x54/0xd0 [<ffffffff8113dfa7>] audit_receive_msg+0x967/0xd70 ---------------- (gdb) list *audit_receive_msg+0x967 0xffffffff8113dff7 is in audit_receive_msg (kernel/audit.c:1133). 1132 skb = audit_make_reply(0, AUDIT_REPLACE, 0, 0, &pvnr, sizeof(pvnr)); --------------- [<ffffffff8113e402>] audit_receive+0x52/0xa0 [<ffffffff8166c561>] netlink_unicast+0x181/0x240 [<ffffffff8166c8e2>] netlink_sendmsg+0x2c2/0x3b0 [<ffffffff816112e8>] sock_sendmsg+0x38/0x50 [<ffffffff816117a2>] SYSC_sendto+0x102/0x190 [<ffffffff81612f4e>] SyS_sendto+0xe/0x10 [<ffffffff8176d337>] entry_SYSCALL_64_fastpath+0x1a/0xa5 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-07-19tracing/ring_buffer: Try harder to allocateJoel Fernandes
ftrace can fail to allocate per-CPU ring buffer on systems with a large number of CPUs coupled while large amounts of cache happening in the page cache. Currently the ring buffer allocation doesn't retry in the VM implementation even if direct-reclaim made some progress but still wasn't able to find a free page. On retrying I see that the allocations almost always succeed. The retry doesn't happen because __GFP_NORETRY is used in the tracer to prevent the case where we might OOM, however if we drop __GFP_NORETRY, we risk destabilizing the system if OOM killer is triggered. To prevent this situation, use the __GFP_RETRY_MAYFAIL flag introduced recently [1]. Tested the following still succeeds without destabilizing a system with 1GB memory. echo 300000 > /sys/kernel/debug/tracing/buffer_size_kb [1] https://marc.info/?l=linux-mm&m=149820805124906&w=2 Link: http://lkml.kernel.org/r/20170713021416.8897-1-joelaf@google.com Cc: Tim Murray <timmurray@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-18cgroup: create dfl_root files on subsys registrationTejun Heo
On subsystem registration, css_populate_dir() is not called on the new root css, so the interface files for the subsystem on cgrp_dfl_root aren't created on registration. This is a residue from the days when cgrp_dfl_root was used only as the parking spot for unused subsystems, which no longer is true as it's used as the root for cgroup2. This is often fine as later operations tend to create them as a part of mount (cgroup1) or subtree_control operations (cgroup2); however, it's not difficult to mount cgroup2 with the controller interface files missing as Waiman found out. Fix it by invoking css_populate_dir() on the root css on subsys registration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-18x86/mm, kexec: Allow kexec to be used with SMETom Lendacky
Provide support so that kexec can be used to boot a kernel when SME is enabled. Support is needed to allocate pages for kexec without encryption. This is needed in order to be able to reboot in the kernel in the same manner as originally booted. Additionally, when shutting down all of the CPUs we need to be sure to flush the caches and then halt. This is needed when booting from a state where SME was not active into a state where SME is active (or vice-versa). Without these steps, it is possible for cache lines to exist for the same physical location but tagged both with and without the encryption bit. This can cause random memory corruption when caches are flushed depending on which cacheline is written last. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: <kexec@lists.infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/b95ff075db3e7cd545313f2fb609a49619a09625.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18x86/mm: Add support to access boot related data in the clearTom Lendacky
Boot data (such as EFI related data) is not encrypted when the system is booted because UEFI/BIOS does not run with SME active. In order to access this data properly it needs to be mapped decrypted. Update early_memremap() to provide an arch specific routine to modify the pagetable protection attributes before they are applied to the new mapping. This is used to remove the encryption mask for boot related data. Update memremap() to provide an arch specific routine to determine if RAM remapping is allowed. RAM remapping will cause an encrypted mapping to be generated. By preventing RAM remapping, ioremap_cache() will be used instead, which will provide a decrypted mapping of the boot related data. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/81fb6b4117a5df6b9f2eda342f81bbef4b23d2e5.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18LSM: Remove security_task_create() hook.Tetsuo Handa
Since commit a79be238600d1a03 ("selinux: Use task_alloc hook rather than task_create hook") changed to use task_alloc hook, task_create hook is no longer used. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-07-17genirq/PM: Properly pretend disabled state when force resuming interruptsJuergen Gross
Interrupts with the IRQF_FORCE_RESUME flag set have also the IRQF_NO_SUSPEND flag set. They are not disabled in the suspend path, but must be forcefully resumed. That's used by XEN to keep IPIs enabled beyond the suspension of device irqs. Force resume works by pretending that the interrupt was disabled and then calling __irq_enable(). Incrementing the disabled depth counter was enough to do that, but with the recent changes which use state flags to avoid unnecessary hardware access, this is not longer sufficient. If the state flags are not set, then the hardware callbacks are not invoked and the interrupt line stays disabled in "hardware". Set the disabled and masked state when pretending that an interrupt got disabled by suspend. Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: boris.ostrovsky@oracle.com Link: http://lkml.kernel.org/r/20170717174703.4603-2-jgross@suse.com