summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-03-10tracing: Add usecs modifier for hist trigger timestampsTom Zanussi
Appending .usecs onto a common_timestamp field will cause the timestamp value to be in microseconds instead of the default nanoseconds. A typical latency histogram using usecs would look like this: # echo 'hist:keys=pid,prio:ts0=common_timestamp.usecs ... # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 ... This also adds an external trace_clock_in_ns() to trace.c for the timestamp conversion. Link: http://lkml.kernel.org/r/4e813705a170b3e13e97dc3135047362fb1a39f3.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add hist_data member to hist_fieldTom Zanussi
Allow hist_data access via hist_field. Some users of hist_fields require or will require more access to the associated hist_data. Link: http://lkml.kernel.org/r/d04cd0768f5228ebb4ac0ba4a847bc4d14d4826f.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add per-element variable support to tracing_mapTom Zanussi
In order to allow information to be passed between trace events, add support for per-element variables to tracing_map. This provides a means for histograms to associate a value or values with an entry when it's saved or updated, and retrieved by a subsequent event occurrences. Variables can be set using tracing_map_set_var() and read using tracing_map_read_var(). tracing_map_var_set() returns true or false depending on whether or not the variable has been set or not, which is important for event-matching applications. tracing_map_read_var_once() reads the variable and resets it to the 'unset' state, implementing read-once variables, which are also important for event-matching uses. Link: http://lkml.kernel.org/r/7fa001108252556f0c6dd9d63145eabfe3370d1a.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add hist trigger timestamp supportTom Zanussi
Add support for a timestamp event field. This is actually a 'pseudo-' event field in that it behaves like it's part of the event record, but is really part of the corresponding ring buffer event. To make use of the timestamp field, users can specify "common_timestamp" as a field name for any histogram. Note that this doesn't make much sense on its own either as either a key or value, but needs to be supported even so, since follow-on patches will add support for making use of this field in time deltas. The common_timestamp 'field' is not a bona fide event field - so you won't find it in the event description - but rather it's a synthetic field that can be used like a real field. Note that the use of this field requires the ring buffer be put into 'absolute timestamp' mode, which saves the complete timestamp for each event rather than an offset. This mode will be enabled if and only if a histogram makes use of the "common_timestamp" field. Link: http://lkml.kernel.org/r/97afbd646ed146e26271f3458b4b33e16d7817c2.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Baohong Liu <baohong.liu@intel.com> [kasan use-after-free fix] Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Break out hist trigger assignment parsingTom Zanussi
This will make it easier to add variables, and makes the parsing code cleaner regardless. Link: http://lkml.kernel.org/r/e574b3291bbe15e35a4dfc87e5395aa715701c98.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add ring buffer event param to hist field functionsTom Zanussi
Some events such as timestamps require access to a ring_buffer_event struct; add a param so that hist field functions can access that. Link: http://lkml.kernel.org/r/2ff4af18e72b6002eb86b26b2a7f39cef7d1dfe4.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Give event triggers access to ring_buffer_eventTom Zanussi
The ring_buffer event can provide a timestamp that may be useful to various triggers - pass it into the handlers for that purpose. Link: http://lkml.kernel.org/r/6de592683b59fa70ffa5d43d0109896623fc1367.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add timestamp_mode trace fileTom Zanussi
Add a new option flag indicating whether or not the ring buffer is in 'absolute timestamp' mode. Currently this is only set/unset by hist triggers that make use of a common_timestamp. As such, there's no reason to make this writeable for users - its purpose is only to allow users to determine unequivocally whether or not the ring buffer is in that mode (although absolute timestamps can coexist with the normal delta timestamps, when the ring buffer is in absolute mode, timestamps written while absolute mode is in effect take up more space in the buffer, and are not as efficient). Link: http://lkml.kernel.org/r/e8aa7b1cde1cf15014e66545d06ac6ef2ebba456.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMPTom Zanussi
RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can gather was reserved for something like an absolute timestamp feature for the ring buffer, if not a complete replacement of the current time_delta scheme. This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time stamps. Another way to look at it is that it essentially forces extended time_deltas for all events. The motivation for doing this is to enable time_deltas that aren't dependent on previous events in the ring buffer, making it feasible to use the ring_buffer_event timetamps in a more random-access way, for purposes other than serial event printing. To set/reset this mode, use tracing_set_timestamp_abs() from the previous interface patch. Link: http://lkml.kernel.org/r/477b362dba1ce7fab9889a1a8e885a62c472f041.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10ring-buffer: Add interface for setting absolute time stampsTom Zanussi
Define a new function, tracing_set_time_stamp_abs(), which can be used to enable or disable the use of absolute timestamps rather than time deltas for a trace array. Only the interface is added here; a subsequent patch will add the underlying implementation. Link: http://lkml.kernel.org/r/ce96119de44c7fe0ee44786d15254e9b493040d3.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Remove code which merges duplicatesVedang Patel
We now have the logic to detect and remove duplicates in the tracing_map hash table. The code which merges duplicates in the histogram is redundant now. So, modify this code just to detect duplicates. The duplication detection code is still kept to ensure that any rare race condition which might cause duplicates does not go unnoticed. Link: http://lkml.kernel.org/r/55215cf59e2674391bdaf772fdafc4c393352b03.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add support to detect and avoid duplicatesVedang Patel
A duplicate in the tracing_map hash table is when 2 different entries have the same key and, as a result, the key_hash. This is possible due to a race condition in the algorithm. This race condition is inherent to the algorithm and not a bug. This was fine because, until now, we were only interested in the sum of all the values related to a particular key (the duplicates are dealt with in tracing_map_sort_entries()). But, with the inclusion of variables[1], we are interested in individual values. So, it will not be clear what value to choose when there are duplicates. So, the duplicates need to be removed. The duplicates can occur in the code in the following scenarios: - A thread is in the process of adding a new element. It has successfully executed cmpxchg() and inserted the key. But, it is still not done acquiring the trace_map_elt struct, populating it and storing the pointer to the struct in the value field of tracing_map hash table. If another thread comes in at this time and wants to add an element with the same key, it will not see the current element and add a new one. - There are multiple threads trying to execute cmpxchg at the same time, one of the threads will succeed and the others will fail. The ones which fail will go ahead increment 'idx' and add a new element there creating a duplicate. This patch detects and avoids the first condition by asking the thread which detects the duplicate to loop one more time. There is also a possibility of infinite loop if the thread which is trying to insert goes to sleep indefinitely and the one which is trying to insert a new element detects a duplicate. Which is why, the thread loops for map_size iterations before returning NULL. The second scenario is avoided by preventing the threads which failed cmpxchg() from incrementing idx. This way, they will loop around and check if the thread which succeeded in executing cmpxchg() had the same key. [1] http://lkml.kernel.org/r/cover.1498510759.git.tom.zanussi@linux.intel.com Link: http://lkml.kernel.org/r/e178e89ec399240331d383bd5913d649713110f4.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10Merge branch 'linus' into locking/core, to pick up fixes and dependenciesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-10timekeeping/ntp: Determine the multiplier directly from NTP tick lengthMiroslav Lichvar
When the length of the NTP tick changes significantly, e.g. when an NTP/PTP application is correcting the initial offset of the clock, a large value may accumulate in the NTP error before the multiplier converges to the correct value. It may then take a very long time (hours or even days) before the error is corrected. This causes the clock to have an unstable frequency offset, which has a negative impact on the stability of synchronization with precise time sources (e.g. NTP/PTP using hardware timestamping or the PTP KVM clock). Use division to determine the correct multiplier directly from the NTP tick length and replace the iterative approach. This removes the last major source of the NTP error. The only remaining source is now limited resolution of the multiplier, which is corrected by adding 1 to the multiplier when the system clock is behind the NTP time. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1520620971-9567-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-10timekeeping/ntp: Don't align NTP frequency adjustments to ticksMiroslav Lichvar
When the timekeeping multiplier is changed, the NTP error is updated to correct the clock for the delay between the tick and the update of the clock. This error is corrected in later updates and the clock appears as if the frequency was changed exactly on the tick. Remove this correction to keep the point where the frequency is effectively changed at the time of the update. This removes a major source of the NTP error. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1520620971-9567-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09bug: use %pB in BUG and stack protector failureKees Cook
The BUG and stack protector reports were still using a raw %p. This changes it to %pB for more meaningful output. Link: http://lkml.kernel.org/r/20180301225704.GA34198@beast Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Richard Weinberger <richard.weinberger@gmail.com>, Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09arm64: signal: Ensure si_code is valid for all fault signalsDave Martin
Currently, as reported by Eric, an invalid si_code value 0 is passed in many signals delivered to userspace in response to faults and other kernel errors. Typically 0 is passed when the fault is insufficiently diagnosable or when there does not appear to be any sensible alternative value to choose. This appears to violate POSIX, and is intuitively wrong for at least two reasons arising from the fact that 0 == SI_USER: 1) si_code is a union selector, and SI_USER (and si_code <= 0 in general) implies the existence of a different set of fields (siginfo._kill) from that which exists for a fault signal (siginfo._sigfault). However, the code raising the signal typically writes only the _sigfault fields, and the _kill fields make no sense in this case. Thus when userspace sees si_code == 0 (SI_USER) it may legitimately inspect fields in the inactive union member _kill and obtain garbage as a result. There appears to be software in the wild relying on this, albeit generally only for printing diagnostic messages. 2) Software that wants to be robust against spurious signals may discard signals where si_code == SI_USER (or <= 0), or may filter such signals based on the si_uid and si_pid fields of siginfo._sigkill. In the case of fault signals, this means that important (and usually fatal) error conditions may be silently ignored. In practice, many of the faults for which arm64 passes si_code == 0 are undiagnosable conditions such as exceptions with syndrome values in ESR_ELx to which the architecture does not yet assign any meaning, or conditions indicative of a bug or error in the kernel or system and thus that are unrecoverable and should never occur in normal operation. The approach taken in this patch is to translate all such undiagnosable or "impossible" synchronous fault conditions to SIGKILL, since these are at least probably localisable to a single process. Some of these conditions should really result in a kernel panic, but due to the lack of diagnostic information it is difficult to be certain: this patch does not add any calls to panic(), but this could change later if justified. Although si_code will not reach userspace in the case of SIGKILL, it is still desirable to pass a nonzero value so that the common siginfo handling code can detect incorrect use of si_code == 0 without false positives. In this case the si_code dependent siginfo fields will not be correctly initialised, but since they are not passed to userspace I deem this not to matter. A few faults can reasonably occur in realistic userspace scenarios, and _should_ raise a regular, handleable (but perhaps not ignorable/blockable) signal: for these, this patch attempts to choose a suitable standard si_code value for the raised signal in each case instead of 0. arm64 was the only arch to define a BUS_FIXME code, so after this patch nobody defines it. This patch therefore also removes the relevant code from siginfo_layout(). Cc: James Morse <james.morse@arm.com> Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-09softirq: Consolidate common code in tasklet_[hi]_action()Ingo Molnar
tasklet_action() + tasklet_hi_action() are almost identical. Move the common code from both function into __tasklet_action_common() and let both functions invoke it with different arguments. [ bigeasy: Splitted out from RT's "tasklet: Prevent tasklets from going into infinite spin in RT" and added commit message] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Julia Cartwright <juliac@eso.teric.us> Link: https://lkml.kernel.org/r/20180227164808.10093-3-bigeasy@linutronix.de
2018-03-09softirq: Consolidate common code in __tasklet_[hi]_schedule()Ingo Molnar
__tasklet_schedule() and __tasklet_hi_schedule() are almost identical. Move the common code from both function into __tasklet_schedule_common() and let both functions invoke it with different arguments. [ bigeasy: Splitted out from RT's "tasklet: Prevent tasklets from going into infinite spin in RT" and added commit message. Use this_cpu_ptr(headp) in __tasklet_schedule_common() as suggested by Julia Cartwright ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Julia Cartwright <juliac@eso.teric.us> Link: https://lkml.kernel.org/r/20180227164808.10093-2-bigeasy@linutronix.de
2018-03-09genirq/irq_sim: Return the base of the irq range from irq_sim_init()Bartosz Golaszewski
Returning the base of the allocated interrupt range from irq_sim_init() and devm_irq_sim_init() allows users to handle the logic of associating irq numbers with any other driver-specific resources without having to use irq_sim_irqnum(). Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20180304121018.640-4-brgl@bgdev.pl
2018-03-09genirq/irq_sim: Check the return value of irq_sim_init() for error codesBartosz Golaszewski
As discussed with Marc Zyngier: irq_sim_init() and its devres variant should return the base of the allocated interrupt range on success rather than 0. Make devm_irq_sim_init() check for an error code. This is a preparatory change for modifying irq_sim_init() itself. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20180304121018.640-3-brgl@bgdev.pl
2018-03-09genirq/irq_sim: Explicitly include slab.hBartosz Golaszewski
kfree() is used in the irq_sim code but slab.h is pulled in indirectly via irq.h. Include it explicitly. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20180304121018.640-2-brgl@bgdev.pl
2018-03-09rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsitesBoqun Feng
When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a HARDIRQ-safe->HARDIRQ-unsafe deadlock: ================================ WARNING: inconsistent lock state 4.16.0-rc4+ #1 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. takes: __schedule+0xbe/0xaf0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock+0x2a/0x40 scheduler_tick+0x47/0xf0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->lock); <Interrupt> lock(&rq->lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/724: rcu_torture_read_lock+0x0/0x70 stack backtrace: CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: lock_acquire+0x90/0x200 ? __schedule+0xbe/0xaf0 _raw_spin_lock+0x2a/0x40 ? __schedule+0xbe/0xaf0 __schedule+0xbe/0xaf0 preempt_schedule_irq+0x2f/0x60 retint_kernel+0x1b/0x2d RIP: 0010:rcu_read_unlock_special+0x0/0x680 ? rcu_torture_read_unlock+0x60/0x60 __rcu_read_unlock+0x64/0x70 rcu_torture_read_unlock+0x17/0x60 rcu_torture_reader+0x275/0x450 ? rcutorture_booster_init+0x110/0x110 ? rcu_torture_stall+0x230/0x230 ? kthread+0x10e/0x130 kthread+0x10e/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? call_usermodehelper_exec_async+0x11a/0x150 ret_from_fork+0x3a/0x50 This happens with the following even sequence: preempt_schedule_irq(); local_irq_enable(); __schedule(): local_irq_disable(); // irq off ... rcu_note_context_switch(): rcu_note_preempt_context_switch(): rcu_read_unlock_special(): local_irq_save(flags); ... raw_spin_unlock_irqrestore(...,flags); // irq remains off rt_mutex_futex_unlock(): raw_spin_lock_irq(); ... raw_spin_unlock_irq(); // accidentally set irq on <return to __schedule()> rq_lock(): raw_spin_lock(); // acquiring rq->lock with irq on which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause deadlocks in scheduler code. This problem was introduced by commit 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints"). That brought the user of rt_mutex_futex_unlock() with irq off. To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock() with irq off. Fixes: 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints") Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
2018-03-09bpf: comment why dots in filenames under BPF virtual FS are not allowedQuentin Monnet
When pinning a file under the BPF virtual file system (traditionally /sys/fs/bpf), using a dot in the name of the location to pin at is not allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be rejected with -EPERM. This check was introduced at the same time as the BPF file system itself, with commit b2197755b263 ("bpf: add support for persistent maps/progs"). At this time, it was checked in a function called "bpf_dname_reserved()", which made clear that using a dot was reserved for future extensions. This function disappeared and the check was moved elsewhere with commit 0c93b7d85d40 ("bpf: reject invalid names right in ->lookup()"), and the meaning of the dot ban was lost. The present commit simply adds a comment in the source to explain to the reader that the usage of dots is reserved for future usage. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-09perf/core: Fix ctx_event_type in ctx_resched()Song Liu
In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is added. However, ctx_resched() calculates ctx_event_type before checking this condition. As a result, pinned events will NOT get higher priority than flexible events. The following shows this issue on an Intel CPU (where ref-cycles can only use one hardware counter). 1. First start: perf stat -C 0 -e ref-cycles -I 1000 2. Then, in the second console, run: perf stat -C 0 -e ref-cycles:D -I 1000 The second perf uses pinned events, which is expected to have higher priority. However, because it failed in ctx_resched(). It is never run. This patch fixes this by calculating ctx_event_type after re-evaluating event_type. Reported-by: Ephraim Park <ephiepark@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <jolsa@redhat.com> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 487f05e18aa4 ("perf/core: Optimize event rescheduling on active contexts") Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/completions: Use bool in try_wait_for_completion()gaurav jindal
Since the return type of the function is bool, the internal 'ret' variable should be bool too. Signed-off-by: Gaurav Jindal<gauravjindal1104@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180221125407.GA14292@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Update blocked load when newly idleVincent Guittot
When NEWLY_IDLE load balance is not triggered, we might need to update the blocked load anyway. We can kick an ilb so an idle CPU will take care of updating blocked load or we can try to update them locally before entering idle. In the latter case, we reuse part of the nohz_idle_balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Move idle_balance()Peter Zijlstra
We're going to want to call nohz_idle_balance() or parts thereof from idle_balance(). Since we already have a forward declaration of idle_balance() move it down such that it's below nohz_idle_balance() avoiding the need for a forward declaration for that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Merge CONFIG_NO_HZ_COMMON blocksPeter Zijlstra
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Move rebalance_domains()Peter Zijlstra
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON sections landing next to each other. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Optimize nohz_idle_balance()Peter Zijlstra
Avoid calling update_blocked_averages() when it does not in fact have any by re-using/extending update_nohz_stats(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Reduce the periodic update durationVincent Guittot
Instead of using the cfs_rq_is_decayed() which monitors all *_avg and *_sum, we create a cfs_rq_has_blocked() which only takes care of util_avg and load_avg. We are only interested by these 2 values which are decaying faster than the *_sum so we can stop the periodic update earlier. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Stop NOHZ stats when decayedVincent Guittot
Stopped the periodic update of blocked load when all idle CPUs have fully decayed. We introduce a new nohz.has_blocked that reflect if some idle CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked is set everytime that a Idle CPU can have blocked load and it is then clear when no more blocked load has been detected during an update. We don't need atomic operation but only to make cure of the right ordering when updating nohz.idle_cpus_mask and nohz.has_blocked. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/cpufreq: Provide migration hintPeter Zijlstra
It was suggested that a migration hint might be usefull for the CPU-freq governors. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/nohz: Clean up nohz enter/exitPeter Zijlstra
The primary observation is that nohz enter/exit is always from the current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be an atomic. Secondary is that we appear to have 2 nearly identical hooks in the nohz enter code, set_cpu_sd_state_idle() and nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into nohz_balance_{enter,exit}_idle. Removes an atomic op from both enter and exit paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Update blocked load from NEWIDLEPeter Zijlstra
Since we already iterate CPUs looking for work on NEWIDLE, use this iteration to age the blocked load. If the domain for which this is done completely spand the idle set, we can push the ILB based aging forward. Suggested-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Add NOHZ stats balancingPeter Zijlstra
Teach the idle balancer about the need to update statistics which have a different periodicity from regular balancing. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Restructure nohz_balance_kick()Peter Zijlstra
The current: if (nohz_kick_needed()) nohz_balancer_kick() is pointless complexity, fold them into a single call and avoid the various conditions at the call site. When we introduce multiple different needs to kick the ilb, the above construct also becomes a problem. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Add NOHZ_STATS_KICKPeter Zijlstra
Split the NOHZ idle balancer into doing two separate actions: - update blocked load statistic - actually load-balance Since the latter requires the former, ensure this happens. For now always tag both bits at the same time. Prepares for a future where we can toggle only the STATS bit. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/core: Convert nohz_flags to atomic_tPeter Zijlstra
Using atomic_t allows us to use the more flexible bitops provided there. Also its smaller. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Rewrite CPUFREQ_RT supportPeter Zijlstra
Instead of trying to duplicate scheduler state to track if an RT task is running, directly use the scheduler runqueue state for it. This vastly simplifies things and fixes a number of bugs related to sugov and the scheduler getting out of sync wrt this state. As a consequence we not also update the remove cfs/dl state when iterating the shared mask. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Remove unused CPUFREQ_DLPeter Zijlstra
Bitrot... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09sched/fair: Add ';' after label attributesNorbert Manthey
Due to using GCC defines for configuration, some labels might be unused in certain configurations. While adding a __maybe_unused to the label is fine in general, the line has to be terminated with ';'. This is also reflected in the GCC documentation, but GCC parsed the previous variant without an error message. This has been spotted while compiling with goto-cc, the compiler for the CPROVER tool suite. Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-08audit: link denied should not directly generate PATH recordRichard Guy Briggs
Audit link denied events generate duplicate PATH records which disagree in different ways from symlink and hardlink denials. audit_log_link_denied() should not directly generate PATH records. See: https://github.com/linux-audit/audit-kernel/issues/21 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-03-08audit: make ANOM_LINK obey audit_enabled and audit_dummy_contextRichard Guy Briggs
Audit link denied events emit disjointed records when audit is disabled. No records should be emitted when audit is disabled. See: https://github.com/linux-audit/audit-kernel/issues/21 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-03-08module: propagate error in modules_open()Leon Yu
otherwise kernel can oops later in seq_release() due to dereferencing null file->private_data which is only set if seq_open() succeeds. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: seq_release+0xc/0x30 Call Trace: close_pdeo+0x37/0xd0 proc_reg_release+0x5d/0x60 __fput+0x9d/0x1d0 ____fput+0x9/0x10 task_work_run+0x75/0x90 do_exit+0x252/0xa00 do_group_exit+0x36/0xb0 SyS_exit_group+0xf/0x10 Fixes: 516fb7f2e73d ("/proc/module: use the same logic as /proc/kallsyms for address exposure") Cc: Jessica Yu <jeyu@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # 4.15+ Signed-off-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2018-03-08Merge tag 'mips_fixes_4.16_4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips Pull MIPS fixes from James Hogan: "A miscellaneous pile of MIPS fixes for 4.16: - move put_compat_sigset() to evade hardened usercopy warnings (4.16) - select ARCH_HAVE_PC_{SERIO,PARPORT} for Loongson64 platforms (4.16) - fix kzalloc() failure handling in ath25 (3.19) and Octeon (4.0) - fix disabling of IPIs during BMIPS suspend (3.19)" * tag 'mips_fixes_4.16_4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: MIPS: BMIPS: Do not mask IPIs during suspend MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_SERIO MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_PARPORT signals: Move put_compat_sigset to compat.h to silence hardened usercopy MIPS: OCTEON: irq: Check for null return on kzalloc allocation MIPS: ath25: Check for kzalloc allocation failure
2018-03-08panic: Add closing panic marker parenthesisBorislav Petkov
Otherwise it looks unbalanced. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180306094920.16917-2-bp@alien8.de
2018-03-08bpf: add support to read sample address in bpf programTeng Qin
This commit adds new field "addr" to bpf_perf_event_data which could be read and used by bpf programs attached to perf events. The value of the field is copied from bpf_perf_event_data_kern.addr and contains the address value recorded by specifying sample_type with PERF_SAMPLE_ADDR when calling perf_event_open. Signed-off-by: Teng Qin <qinteng@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>