Age | Commit message (Collapse) | Author |
|
The index is off-by-one when fp->aux->stack_depth
has already been rounded up to 32. In particular,
if stack_depth is 512, the index will be 16.
The fix is to round_up and then takes -1 instead of round_down.
[ 22.318680] ==================================================================
[ 22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
[ 22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
[ 22.321646]
[ 22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G W 4.12.0-rc6-01680-g2ee87db3a287 #22
[ 22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[ 22.324260] Call Trace:
[ 22.324612] dump_stack+0x67/0x99
[ 22.325081] print_address_description+0x1e8/0x290
[ 22.325734] ? bpf_prog_select_runtime+0x48a/0x670
[ 22.326360] kasan_report+0x265/0x350
[ 22.326860] __asan_report_load8_noabort+0x19/0x20
[ 22.327484] bpf_prog_select_runtime+0x48a/0x670
[ 22.328109] bpf_prog_load+0x626/0xd40
[ 22.328637] ? __bpf_prog_charge+0xc0/0xc0
[ 22.329222] ? check_nnp_nosuid.isra.61+0x100/0x100
[ 22.329890] ? __might_fault+0xf6/0x1b0
[ 22.330446] ? lock_acquire+0x360/0x360
[ 22.331013] SyS_bpf+0x67c/0x24d0
[ 22.331491] ? trace_hardirqs_on+0xd/0x10
[ 22.332049] ? __getnstimeofday64+0xaf/0x1c0
[ 22.332635] ? bpf_prog_get+0x20/0x20
[ 22.333135] ? __audit_syscall_entry+0x300/0x600
[ 22.333770] ? syscall_trace_enter+0x540/0xdd0
[ 22.334339] ? exit_to_usermode_loop+0xe0/0xe0
[ 22.334950] ? do_syscall_64+0x48/0x410
[ 22.335446] ? bpf_prog_get+0x20/0x20
[ 22.335954] do_syscall_64+0x181/0x410
[ 22.336454] entry_SYSCALL64_slow_path+0x25/0x25
[ 22.337121] RIP: 0033:0x7f263fe81f19
[ 22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[ 22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
[ 22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
[ 22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
[ 22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
[ 22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
[ 22.343369]
[ 22.343593] The buggy address belongs to the variable:
[ 22.344241] interpreters+0x80/0x980
[ 22.344708]
[ 22.344908] Memory state around the buggy address:
[ 22.345556] ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
[ 22.346449] ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
[ 22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
[ 22.348301] ^
[ 22.349142] ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
[ 22.350058] ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
[ 22.350984] ==================================================================
Fixes: b870aa901f4b ("bpf: use different interpreter depending on required stack size")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS.
The lookup returns a prog-id or map-id to the userspace.
The userspace can then use the BPF_PROG_GET_FD_BY_ID
or BPF_MAP_GET_FD_BY_ID to get a fd.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When doing the following command:
# echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter
it triggered a crash.
This happened with the clean up of probes. It required all callers to the
regex function (doing ftrace filtering) to have ops->private be a pointer to
a trace_array. But for the stack tracer, that is not the case.
Allow for the ops->private to be NULL, and change the function command
callbacks to handle the trace_array pointer being NULL as well.
Fixes: d2afd57a4b96 ("tracing/ftrace: Allow instances to have their own function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Stephen reported the following build warning in UP:
kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want
Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
The timers cpu base lock could not be converted to a raw spinlock becaue
the lock held time was non-deterministic due to cascading and long lasting
timer wheel traversals.
The rework of the timer wheel to the new non-cascading model removed also
the wheel traversals and the lock held times are deterministic now. This
allows to make the lock raw and thereby unbreaks NOHz* on preempt-RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20170627161538.30257-1-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments. This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries. A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.
This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries. If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.
This allows cgroup namespace to function as a delegation boundary by
itself.
v2: Silently ignore nsdelegate specified on !init mounts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
|
|
Restructure cgroup_procs_write_permission() to make extending
permission logic easier.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Init boot up functions may be traced, but they are also freed when the
kernel finishes booting. These are removed from the ftrace tables, and the
debug variable for dyn_ftrace_total_info needs to reflect that as well.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
If the new_hash fails to allocate, then unlock the hash mutex on error.
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Earlier patches introduced ability to record the tgid using the 'record-tgid'
option. Here we read the tgid and output it if the option is enabled.
Link: http://lkml.kernel.org/r/20170626053844.5746-3-joelaf@google.com
Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Inorder to support recording of tgid, the following changes are made:
* Introduce a new API (tracing_record_taskinfo) to additionally record the tgid
along with the task's comm at the same time. This has has the benefit of not
setting trace_cmdline_save before all the information for a task is saved.
* Add a new API tracing_record_taskinfo_sched_switch to record task information
for 2 tasks at a time (previous and next) and use it from sched_switch probe.
* Preserve the old API (tracing_record_cmdline) and create it as a wrapper
around the new one so that existing callers aren't affected.
* Reuse the existing sched_switch and sched_wakeup probes to record tgid
information and add a new option 'record-tgid' to enable recording of tgid
When record-tgid option isn't enabled to being with, we take care to make sure
that there's isn't memory or runtime overhead.
Link: http://lkml.kernel.org/r/20170627020155.5139-1-joelaf@google.com
Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The dyn_ftrace_total_info file is used to show how many functions have been
converted into nops and can be used by ftrace. The problem is that it does
not get decremented when functions are removed (init boot code being freed,
and modules being freed). That means the number is very inaccurate everytime
functions are removed from the ftrace tables. Decrement it when functions
are removed.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
ftrace_arch_read_dyn_info() was used so that archs could add its own debug
information into the dyn_ftrace_total_info in the tracefs file system. That
file is for debugging usage of dynamic ftrace. No arch uses that function
anymore, so just get rid of it.
This also allows for tracing_read_dyn_info() to be cleaned up a bit.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The first parameter of swsusp_alloc is not used, so drop it.
Signed-off-by: BaoJun Luo <baojun.luo@samsung.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Kbuild reported a build failure when CONFIG_STRICT_KERNEL_RWX was
enabled on powerpc. We don't yet have ARCH_HAS_SET_MEMORY and ppc32
saw a build failure.
I've only done a basic compile test with a config that has
hibernation enabled.
Fixes: 50327ddfbc92 (kernel/power/snapshot.c: use set_memory.h header)
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
This switches the seccomp usage tracking from atomic_t to refcount_t to
gain refcount overflow protections.
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Hans Liljestrand <hans.liljestrand@aalto.fi>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
This just cleans up the core dumping logic to avoid the braces around
the RET_KILL case.
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
When a module filter is added to set_ftrace_filter, if the module is not
loaded, it is cached. This should be considered an active filter, and
function tracing should be filtered by this. That is, if a cached module
filter is the only filter set, then no function tracing should be happening,
as all the functions available will be filtered out.
This makes sense, as the reason to add a cached module filter, is to trace
the module when you load it. There shouldn't be any other tracing happening
until then.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
If a module is cached in the set_ftrace_filter, and that module is loaded,
then enable tracing on that module as if the cached module text was written
into set_ftrace_filter just as the module is loaded.
# echo ":mod:kvm_intel" >
# cat /sys/kernel/tracing/set_ftrace_filter
#### all functions enabled ####
:mod:kvm_intel
# modprobe kvm_intel
# cat /sys/kernel/tracing/set_ftrace_filter
vmx_get_rflags [kvm_intel]
vmx_get_pkru [kvm_intel]
vmx_get_interrupt_shadow [kvm_intel]
vmx_rdtscp_supported [kvm_intel]
vmx_invpcid_supported [kvm_intel]
[..]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
When writing in a module filter into set_ftrace_filter for a module that is
not yet loaded, it it cached, and will be executed when the module is loaded
(although that is not implemented yet at this commit). Display the list of
cached modules to be traced.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
This is the start of the infrastructure work to allow for tracing module
functions before it is loaded.
Currently the following command:
# echo :mod:some-mod > set_ftrace_filter
will enable tracing of all functions within the module "some-mod" if it is
loaded. What we want, is if the module is not loaded, that line will be
saved. When the module is loaded, then the "some-mod" will have that line
executed on it, so that the functions within it starts being traced.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Check irq state in enable/disable/unmask/mask_irq to avoid unnecessary
low level irq function calls.
This has two advantages:
- Conditionals are faster than hardware access
- Solves issues with the underlying refcounting of the pinctrl
infrastructure
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-2-git-send-email-jeffy.chen@rock-chips.com
|
|
The irq default state is set to disabled when allocating irq desc, but the
masked state flag is not set. This is inconsistent vs. the state tracking
logic which is used to prevent unnecessary calls to hardware level irq chip
functions.
Set the masked state flag as well.
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-1-git-send-email-jeffy.chen@rock-chips.com
|
|
These apis only need to be defined if CONFIG_COMPAT is
enabled.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
As we change the user space type for the timerfd and posix timer
functions to newer data types, we need some form of conversion
helpers to avoid duplicating that logic.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add helper functions to convert between struct timespec64 and
struct timespec at userspace boundaries.
This is a preparatory patch to use timespec64 as the basic type
internally in the kernel as timespec is not y2038 safe on 32 bit systems.
The patch helps the cause by containing all data conversions at the
userspace boundaries within these functions.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A few fixes for timekeeping and timers:
- Plug a subtle race due to a missing READ_ONCE() in the timekeeping
code where reloading of a pointer results in an inconsistent
callback argument being supplied to the clocksource->read function.
- Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the
time keeping core code, to prevent a possible discontuity.
- Apply a similar fix to the arm64 vdso clock_gettime()
implementation
- Add missing includes to clocksource drivers, which relied on
indirect includes which fails in certain configs.
- Use the proper iomem pointer for read/iounmap in a probe function"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW
time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
time: Fix clock->read(clock) race around clocksource changes
clocksource: Explicitly include linux/clocksource.h when needed
clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Three fixlets for perf:
- Return the proper error code if aux buffers for a event are not
supported.
- Calculate the probe offset for inlined functions correctly
- Update the Skylake DTLB load/store miss event so it can count 1G
TLB entries as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Fix probe definition for inlined functions
perf/x86/intel: Add 1G DTLB load/store miss support for SKL
perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
|
|
time
An interrupt behaves with a burst of activity with periodic interval of time
followed by one or two peaks of longer interval.
As the time intervals are periodic, statistically speaking they follow a normal
distribution and each interrupts can be tracked individually.
Add a mechanism to compute the statistics on all interrupts, except the
timers which are deterministic from a prediction point of view, as their
expiry time is known.
The goal is to extract the periodicity for each interrupt, with the last
timestamp and sum them, so the next event can be predicted to a certain
extent.
Taking the earliest prediction gives the expected wakeup on the system
(assuming a timer won't expire before).
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1498227072-5980-2-git-send-email-daniel.lezcano@linaro.org
|
|
The interrupt framework gives a lot of information about each interrupt. It
does not keep track of when those interrupts occur though, which is a
prerequisite for estimating the next interrupt arrival for power management
purposes.
Add a mechanism to record the timestamp for each interrupt occurrences in a
per-CPU circular buffer to help with the prediction of the next occurrence
using a statistical model.
Each CPU can store up to IRQ_TIMINGS_SIZE events <irq, timestamp>, the
current value of IRQ_TIMINGS_SIZE is 32.
Each event is encoded into a single u64, where the high 48 bits are used
for the timestamp and the low 16 bits are for the irq number.
A static key is introduced so when the irq prediction is switched off at
runtime, the overhead is near to zero.
It results in most of the code in internals.h for inline reasons and a very
few in the new file timings.c. The latter will contain more in the next patch
which will provide the statistical model for the next event prediction.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1498227072-5980-1-git-send-email-daniel.lezcano@linaro.org
|
|
debugfs_remove() has it's own NULL pointer check. Remove the conditional
and make irq_remove_debugfs_entry() an inline helper
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull timer fix from Eric Biederman:
"This fixes an issue of confusing injected signals with the signals
from posix timers that has existed since posix timers have been in the
kernel.
This patch is slightly simpler than my earlier version of this patch
as I discovered in testing that I had misspelled "#ifdef
CONFIG_POSIX_TIMERS". So I deleted that unnecessary test and made
setting of resched_timer uncondtional.
I have tested this and verified that without this patch there is a
nasty hang that is easy to trigger, and with this patch everything
works properly"
Thomas Gleixner dixit:
"It fixes the problem at hand and covers the ptrace case as well, which
I missed.
Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Only reschedule timers on signals timers have sent
|
|
The effective_load() function was only used by the NUMA balancing
code, and not by the regular load balancing code. Now that the
NUMA balancing code no longer uses it either, get rid of it.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-5-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Since select_idle_sibling() can place a task anywhere on a socket,
comparing loads between individual CPU cores makes no real sense
for deciding whether to do an affine wakeup across sockets, either.
Instead, compare the load between the sockets in a similar way the
load balancer and the numa balancing code do.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-4-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Then 'this_cpu' and 'prev_cpu' are in the same socket, select_idle_sibling()
will do its thing regardless of the return value of wake_affine().
Just return true and don't look at all the other things.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jhladky@redhat.com
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-3-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Several tests in the NAS benchmark seem to run a lot slower with
NUMA balancing enabled, than with NUMA balancing disabled. The
slower run time corresponds with increased idle time.
Overriding the final test of migrate_degrades_locality (but still
doing the other NUMA tests first) seems to improve performance
of those benchmarks.
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20170623165530.22514-2-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.
For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.
To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.
Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This helps making sched/core.c smaller and hopefully easier to understand and maintain.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Make CONFIG_CPUSETS=y depend on SMP as this feature makes no sense
on UP. This allows for configuring out cpuset_cpumask_can_shrink()
and task_can_attach() entirely, which shrinks the kernel a bit.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170614171926.8345-2-nicolas.pitre@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently, when a function is not found in kallsyms, instead of simply
showing the function address, it shows nothing at all:
# echo ':mod:kvm_intel' > /sys/kernel/tracing/set_ftrace_filter
# echo function > /sys/kernel/tracing/set_ftrace_filter
# qemu -enable-kvm /home/my-qemu-image
<Ctrl-C>
# rmmod kvm_intel
# cat /sys/kernel/tracing/trace
qemu-system-x86-2408 [001] d..2 135.013238: <-kvm_arch_hardware_enable
qemu-system-x86-2408 [001] .... 135.014574: <-kvm_arch_vm_ioctl
qemu-system-x86-2408 [001] .... 135.015420: <-kvm_vm_ioctl_check_extension
qemu-system-x86-2408 [001] .... 135.045411: <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: <-__do_cpuid_ent
qemu-system-x86-2408 [001] ...1 135.045413: <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045413: <-__do_cpuid_ent
When it should show:
qemu-system-x86-2408 [001] d..2 135.013238: 0xffffffffa02a39f0 <-kvm_arch_hardware_enable
qemu-system-x86-2408 [001] .... 135.014574: 0xffffffffa02a2ba0 <-kvm_arch_vm_ioctl
qemu-system-x86-2408 [001] .... 135.015420: 0xffffffffa029e4e0 <-kvm_vm_ioctl_check_extension
qemu-system-x86-2408 [001] .... 135.045411: 0xffffffffa02a1380 <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: 0xffffffffa029e160 <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: 0xffffffffa029e180 <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045412: 0xffffffffa029e520 <-__do_cpuid_ent
qemu-system-x86-2408 [001] ...1 135.045413: 0xffffffffa02a13b0 <-__do_cpuid_ent
qemu-system-x86-2408 [001] .... 135.045413: 0xffffffffa02a1380 <-__do_cpuid_ent
instead.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
It did seem like a good idea at the time, but it never really
caught on, and auto-recursive domains remain unused 3 years after
having been introduced.
Oh well, time for a late spring cleanup.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
We can have irq domains that are identified by the same fwnode
(because they are serviced by the same HW), and yet have different
functionnality (because they serve different busses, for example).
This is what we use the bus_token field.
Since we don't use this field when generating the domain name,
all the aliasing domains will get the same name, and the debugfs
file creation fails. Also, bus_token is updated by individual drivers,
and the core code is unaware of that update.
In order to sort this mess, let's introduce a helper that takes care
of updating bus_token, and regenerate the debugfs file.
A separate patch will update all the individual users.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Currently the irq vector spread algorithm is restricted to online CPUs,
which ties the IRQ mapping to the currently online devices and doesn't deal
nicely with the fact that CPUs could come and go rapidly due to e.g. power
management.
Instead assign vectors to all present CPUs to avoid this churn.
Build a map of all possible CPUs for a given node, as the architectures
only provide a map of all onlines CPUs. Do this dynamically on each call
for the vector assingments, which is a bit suboptimal and could be
optimized in the future by provinding a mapping from the arch code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-nvme@lists.infradead.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170603140403.27379-5-hch@lst.de
|
|
Avoid trying to add a newly online CPU to the effective affinity mask of an
started up interrupt. That interrupt will either stay on the already online
CPU or move around for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.431321047@linutronix.de
|
|
Many interrupt chips allow only a single CPU as interrupt target. The core
code has no knowledge about that. That's unfortunate as it could avoid
trying to readd a newly online CPU to the effective affinity mask.
Add the status flag and the necessary accessors.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.352343969@linutronix.de
|
|
If a CPU goes offline, interrupts affine to the CPU are moved away. If the
outgoing CPU is the last CPU in the affinity mask the migration code breaks
the affinity and sets it it all online cpus.
This is a problem for affinity managed interrupts as CPU hotplug is often
used for power management purposes. If the affinity is broken, the
interrupt is not longer affine to the CPUs to which it was allocated.
The affinity spreading allows to lay out multi queue devices in a way that
they are assigned to a single CPU or a group of CPUs. If the last CPU goes
offline, then the queue is not longer used, so the interrupt can be
shutdown gracefully and parked until one of the assigned CPUs comes online
again.
Add a graceful shutdown mechanism into the irq affinity breaking code path,
mark the irq as MANAGED_SHUTDOWN and leave the affinity mask unmodified.
In the online path, scan the active interrupts for managed interrupts and
if the interrupt is functional and the newly online CPU is part of the
affinity mask, restart the interrupt if it is marked MANAGED_SHUTDOWN or if
the interrupts is started up, try to add the CPU back to the effective
affinity mask.
Originally-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170619235447.273417334@linutronix.de
|
|
Affinity managed interrupts should keep their assigned affinity accross CPU
hotplug. To avoid magic hackery in device drivers, the core code shall
manage them transparently and set these interrupts into a managed shutdown
state when the last CPU of the assigned affinity mask goes offline. The
interrupt will be restarted when one of the CPUs in the assigned affinity
mask comes back online.
Add the necessary logic to irq_startup(). If an interrupt is requested and
started up, the code checks whether it is affinity managed and if so, it
checks whether a CPU in the interrupts affinity mask is online. If not, it
puts the interrupt into managed shutdown state.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.189851170@linutronix.de
|
|
In order to handle managed interrupts gracefully on irq_startup() so they
won't lose their assigned affinity, it's necessary to allow startups which
keep the interrupts in managed shutdown state, if none of the assigend CPUs
is online. This allows drivers to request interrupts w/o the CPUs being
online, which avoid online/offline churn in drivers.
Add a force argument which can override that decision and let only
request_irq() and enable_irq() allow the managed shutdown
handling. enable_irq() is required, because the interrupt might be
requested with IRQF_NOAUTOEN and enable_irq() invokes irq_startup() which
would then wreckage the assignment again. All other callers force startup
and potentially break the assigned affinity.
No functional change as this only adds the function argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20170619235447.112094565@linutronix.de
|