summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-04-08tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentationSteven Rostedt (Red Hat)
Add documentation about TRACE_SYSTEM needing to be alpha-numeric or with underscores, and that if it is not, then the use of TRACE_SYSTEM_VAR is required to make something that is. An example of this is shown in samples/trace_events/trace-events-sample.h Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08tracing: Give system name a pointerSteven Rostedt (Red Hat)
Normally the compiler will use the same pointer for a string throughout the file. But there's no guarantee of that happening. Later changes will require that all events have the same pointer to the system string. Name the system string and have all events point to it. Testing this, it did not increases the size of the text, except for the notes section, which should not harm the real size any. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08brcmsmac: Move each system tracepoints to their own headerSteven Rostedt (Red Hat)
Every tracing file must have its own TRACE_SYSTEM defined. The brcmsmac tracepoint header broke this and added in the middle of the file: #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac_tx #undef TRACE_SYSTEM #define TRACE_SYSTEM brcmsmac_msg Unfortunately, this broke new code in the ftrace infrastructure. Moving each of these TRACE_SYSTEMs into their own trace file with just one TRACE_SYSTEM per file fixes the issue. Link: http://lkml.kernel.org/r/5524D99C.1050902@broadcom.com Acked-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08iwlwifi: Move each system tracepoints to their own headerSteven Rostedt (Red Hat)
Every tracing file must have its own TRACE_SYSTEM defined. The iwlwifi tracepoint header broke this and added in the middle of the file: #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_io #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_ucode #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_msg #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi_data #undef TRACE_SYSTEM #define TRACE_SYSTEM iwlwifi Unfortunately, this broke new code in the ftrace infrastructure. Moving each of these TRACE_SYSTEMs into their own trace file with just one TRACE_SYSTEM per file fixes the issue. Link: http://lkml.kernel.org/r/1428479094.2809.3.camel@sipsolutions.net Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08perf tools: Merge all perf_event_attr print functionsPeter Zijlstra
Currently there's 3 (that I found) different and incomplete implementations of printing perf_event_attr. This is quite silly. Merge the lot. While this patch does not retain the exact form all printing that I found is debug output and thus it should not be critical. Also, I cannot find a single print_event_desc() caller. Pre: $ perf record -vv -e cycles -- sleep 1 ------------------------------------------------------------ perf_event_attr: type 0 size 104 config 0 sample_period 4000 sample_freq 4000 sample_type 0x107 read_format 0 disabled 1 inherit 1 pinned 0 exclusive 0 exclude_user 0 exclude_kernel 0 exclude_hv 0 exclude_idle 0 mmap 1 comm 1 mmap2 1 comm_exec 1 freq 1 inherit_stat 0 enable_on_exec 1 task 1 watermark 0 precise_ip 0 mmap_data 0 sample_id_all 1 exclude_host 0 exclude_guest 1 excl.callchain_kern 0 excl.callchain_user 0 wakeup_events 0 wakeup_watermark 0 bp_type 0 bp_addr 0 config1 0 bp_len 0 config2 0 branch_sample_type 0 sample_regs_user 0 sample_stack_user 0 sample_regs_intr 0 ------------------------------------------------------------ $ perf evlist -vv cycles: sample_freq=4000, size: 104, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, comm_exec: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1 Post: $ ./perf record -vv -e cycles -- sleep 1 ------------------------------------------------------------ perf_event_attr: size 112 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|PERIOD disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ $ ./perf evlist -vv cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Ahern <dsahern@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150407091150.644238729@infradead.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf record: Add clockid parameterPeter Zijlstra
Teach perf-record about the new perf_event_attr::{use_clockid, clockid} fields. Add a simple parameter to set the clock (if any) to be used for the events to be recorded into the data file. Since we store the entire perf_event_attr in the EVENT_DESC section we also already store the used clockid in the data file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yunlong Song <yunlong.song@huawei.com> Link: http://lkml.kernel.org/r/20150407154851.GR23123@twins.programming.kicks-ass.net [ Conditionally define CLOCK_BOOTTIME, at least rhel6 doesn't have it - dsahern Ditto for CLOCK_MONOTONIC_RAW, sles11sp2 doesn't have it - yunlong.song ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Use replay_repeat to calculate the runavg of cpu usage ↵Yunlong Song
instead of the default value 10 Since sched->replay_repeat is set to 10 as default, the sched->run_avg, sched->runavg_cpu_usage, and sched->runavg_parent_cpu_usage all use 10 to calculate their value. However, the replay_repeat can be changed to other value by using -r option, so the calculation above should use replay_repeat to achieve more accurate results instead of the default value 10. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-10-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Support using -f to override perf.data file ownershipYunlong Song
Enable to use perf.data when it is not owned by current user or root. Example: $ ls -al perf.data -rw------- 1 Yunlong.Song Yunlong.Song 5321918 Mar 25 15:14 perf.data $ sudo id uid=0(root) gid=0(root) groups=0(root),64(pkcs11) Before this patch: $ sudo perf sched replay -f run measurement overhead: 98 nsecs sleep measurement overhead: 52909 nsecs the run test took 1000015 nsecs the sleep test took 1054253 nsecs File perf.data not owned by current user or root (use -f to override) As shown above, the -f option does not work at all. After this patch: $ sudo perf sched replay -f run measurement overhead: 221 nsecs sleep measurement overhead: 40514 nsecs the run test took 1000003 nsecs the sleep test took 1056098 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 ... ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ------------------------------------------------------------ #1 : 50.198, ravg: 50.20, cpu: 2335.18 / 2335.18 #2 : 219.099, ravg: 67.09, cpu: 2835.11 / 2385.17 #3 : 238.626, ravg: 84.24, cpu: 3278.26 / 2474.48 #4 : 200.364, ravg: 95.85, cpu: 2977.41 / 2524.77 #5 : 176.882, ravg: 103.96, cpu: 2801.35 / 2552.43 #6 : 191.093, ravg: 112.67, cpu: 2813.70 / 2578.56 #7 : 189.448, ravg: 120.35, cpu: 2809.21 / 2601.62 #8 : 200.637, ravg: 128.38, cpu: 2849.91 / 2626.45 #9 : 248.338, ravg: 140.37, cpu: 4380.61 / 2801.87 #10 : 511.139, ravg: 177.45, cpu: 3077.73 / 2829.45 As shown above, the -f option really works now. Besides for replay, -f option can also work for latency and map. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-9-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Fix the EMFILE error caused by the limitation of the ↵Yunlong Song
maximum open files The soft maximum number of open files for a calling process is 1024, which is defined as INR_OPEN_CUR in include/uapi/linux/fs.h, and the hard maximum number of open files for a calling process is 4096, which is defined as INR_OPEN_MAX in include/uapi/linux/fs.h. Both INR_OPEN_CUR and INR_OPEN_MAX are used to limit the value of RLIMIT_NOFILE in include/asm-generic/resource.h. And the soft maximum number finally decides the limitation of the maximum files which are allowed to be opened. That is to say a process can use at most 1024 file descriptors for its o pened files, or an EMFILE error will happen. This error can be fixed by increasing the soft maximum number, under the constraint that the soft maximum number can not exceed the hard maximum number, or both soft and hard maximum number should be increased simultaneously with privilege. For perf sched replay, it uses sys_perf_event_open to create the file descriptor for each of the tasks in order to handle information of perf events. That is to say each task needs a unique file descriptor. In x86_64, there may be over 1024 or 4096 tasks correspoinding to the record in perf.data, which causes that no enough file descriptors can be used. As a result, EMFILE error happens and stops the replay process. To solve this problem, we adaptively increase the soft and hard maximum number of open files with a '-f' option. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ cat /proc/sys/fs/file-max 6815744 $ ulimit -Sn 1024 $ ulimit -Hn 4096 Before this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) After this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) Have a try with -f option $ perf sched replay -f ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ------------------------------------------------------------ #1 : 54.401, ravg: 54.40, cpu: 3285.21 / 3285.21 #2 : 199.548, ravg: 68.92, cpu: 4999.65 / 3456.66 #3 : 170.483, ravg: 79.07, cpu: 1349.94 / 3245.99 #4 : 192.034, ravg: 90.37, cpu: 1322.88 / 3053.67 #5 : 182.929, ravg: 99.62, cpu: 1406.51 / 2888.96 #6 : 152.974, ravg: 104.96, cpu: 1167.54 / 2716.82 #7 : 155.579, ravg: 110.02, cpu: 2992.53 / 2744.39 #8 : 130.557, ravg: 112.08, cpu: 1126.43 / 2582.59 #9 : 138.520, ravg: 114.72, cpu: 1253.22 / 2449.65 #10 : 134.328, ravg: 116.68, cpu: 1587.95 / 2363.48 Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-8-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Handle the dead halt of sem_wait when create_tasks() ↵Yunlong Song
fails for any task Since there is sem_wait for each task in the wait_for_tasks(), e.g. sem_wait(&task->work_done_sem). The sem_wait can continue only when work_done_sem is greater than 0, or it will be blocked. For perf sched replay, one task may sem_post the work_done_sem of another task, which causes the work_done_sem of that task processed in a reasonable sequence, e.g. sem_post, sem_wait, sem_wait, sem_post... This sequence simulates the sched process of the running tasks at the time when perf sched record runs. As a result, all the tasks are required and their threads must be successfully created. If any one (task A) of the tasks fails to create its thread, then another task (task B), whose work_done_sem needs sem_post from that failed task A, may likely block itself due to seg_wait. And this is a dead halt, since task B's thread_func cannot continue at all. To solve this problem, perf sched replay should exit once any task fails to create its thread. Example: Test environment: x86_64 with 160 cores Before this patch: $ perf sched replay ... Error: sys_perf_event_open() syscall returned with -1 (Too many open files) ------------------------------------------------------------ <- dead halt After this patch: $ perf sched replay ... task 1551 ( <unknown>: 0), nr_events: 10 Error: sys_perf_event_open() syscall returned with -1 (Too many open files) $ As shown above, perf sched replay finishes the process after printing an error message and does not block itself. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-7-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Fix the segmentation fault problem caused by pr_err in ↵Yunlong Song
threads The pr_err in self_open_counters() prints error message to stderr. Unlike stdout, stderr uses memory buffer on the stack of each calling process. The pr_err in self_open_counters() works in a thread called thread_func created in function create_tasks, which concurrently creates sched->nr_tasks threads. If the error happens and pr_err prints the error message in each of these threads, the stack size of the perf process (default is 8192 kbytes) will quickly run out and the segmentation fault will happen then. To solve this problem, pr_err with self_open_counters() should be moved from newly created threads to the old main thread of the perf process. Then the pr_err can work in a stable situation without the strange segmentation fault problem. Example: Test environment: x86_64 with 160 cores Before this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 Segmentation fault After this patch: $ perf sched replay ... task 1549 ( :163132: 163132), nr_events: 1 task 1550 ( :163540: 163540), nr_events: 1 task 1551 ( <unknown>: 0), nr_events: 10 ... As shown above, the result continues without any segmentation fault. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-6-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to ↵Yunlong Song
the different pid_max configurations Although the memory of pid_to_task can be allocated via calloc according to the value of /proc/sys/kernel/pid_max, it cannot handle the case when pid_max is changed after 'perf sched record' has created its perf.data. If the new pid_max configured in 'perf sched replay' is smaller than the old pid_max configured in 'perf sched record', then it will cause the assertion failure problem. To solve this problem, we realloc the memory of pid_to_task stepwise once the passed-in pid parameter in register_pid is larger than the current pid_max. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ perf sched record ls $ echo 5000 > /proc/sys/kernel/pid_max $ cat /proc/sys/kernel/pid_max 5000 Before this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55356 nsecs the run test took 1000011 nsecs the sleep test took 1060940 nsecs perf: builtin-sched.c:337: register_pid: Assertion `!(pid >= (unsigned long)pid_max)' failed. Aborted After this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55611 nsecs the run test took 1000026 nsecs the sleep test took 1060486 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 task 3 ( :5: 5), nr_events: 1 ... Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-5-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Alloc the memory of pid_to_task dynamically to adapt to ↵Yunlong Song
the unexpected change of pid_max The current memory allocation of struct task_desc *pid_to_task[MAX_PID] is in a permanent and preset way, and it has two problems: Problem 1: If the pid_max, which is the max number of pids in the system, is much smaller than MAX_PID (1024*1000), then it causes a waste of stack memory. This may happen in the case where the number of cpu cores is much smaller than 1000. Problem 2: If the pid_max is changed from the default value to a value larger than MAX_PID, then it will cause assertion failure problem. The maximum value of pid_max can be set to pid_max_max (see pidmap_init defined in kernel/pid.c), which equals to PID_MAX_LIMIT. In x86_64, PID_MAX_LIMIT is 4*1024*1024 (defined in include/linux/threads.h). This value is much larger than MAX_PID, and will take up 32768 Kbytes (4*1024*1024*8/1024) for memory allocation of pid_to_task, which is much larger than the default 8192 Kbytes of the stack size of calling process. Due to these two problems, we use calloc to allocate the memory of pid_to_task dynamically. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 $ echo 1025000 > /proc/sys/kernel/pid_max $ cat /proc/sys/kernel/pid_max 1025000 Run some applications until the pid of some process is greater than the value of MAX_PID (1024*1000). Before this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55480 nsecs the run test took 1000008 nsecs the sleep test took 1063151 nsecs perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 1024000)' failed. Aborted After this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55435 nsecs the run test took 1000004 nsecs the sleep test took 1059312 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 task 3 ( :5: 5), nr_events: 1 ... Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-4-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Increase the MAX_PID value to fix assertion failure problemYunlong Song
Current MAX_PID is only 65536, which will cause assertion failure problem when CPU cores are more than 64 in x86_64. This is because the pid_max value in x86_64 is at least PIDS_PER_CPU_DEFAULT * num_possible_cpus() (see function pidmap_init defined in kernel/pid.c), where PIDS_PER_CPU_DEFAULT is 1024 (defined in include/linux/threads.h). Thus for MAX_PID = 65536, the correspoinding CPU cores are 65536/1024=64. This is obviously not enough at all for x86_64, and will cause an assertion failure problem due to BUG_ON(pid >= MAX_PID) in the codes. We increase MAX_PID value from 65536 to 1024*1000, which can be used in x86_64 with 1000 cores. This number is finally decided according to the limitation of stack size of calling process. Use 'ulimit -a', the result shows the stack size of any process is 8192 Kbytes, which is defined in include/uapi/linux/resource.h (#define _STK_LIM (8*1024*1024)). Thus we choose a large enough value for MAX_PID, and make it satisfy to the limitation of the stack size, i.e., making the perf process take up a memory space just smaller than 8192 Kbytes. We have calculated and tested that 1024*1000 is OK for MAX_PID. This means perf sched replay can now be used with at most 1000 cores in x86_64 without any assertion failure problem. Example: Test environment: x86_64 with 160 cores $ cat /proc/sys/kernel/pid_max 163840 Before this patch: $ perf sched replay run measurement overhead: 240 nsecs sleep measurement overhead: 55379 nsecs the run test took 1000004 nsecs the sleep test took 1059424 nsecs perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 65536)' failed. Aborted After this patch: $ perf sched replay run measurement overhead: 221 nsecs sleep measurement overhead: 55397 nsecs the run test took 999920 nsecs the sleep test took 1053313 nsecs nr_run_events: 10 nr_sleep_events: 1562 nr_wakeup_events: 5 task 0 ( :1: 1), nr_events: 1 task 1 ( :2: 2), nr_events: 1 task 2 ( :3: 3), nr_events: 1 task 3 ( :5: 5), nr_events: 1 ... Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-3-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf sched replay: Use struct task_desc instead of struct task_task for ↵Yunlong Song
correct meaning There is no struct task_task at all, thus it is a typo error in the old commits, now fix it to what it should be in order to avoid unnecessary misunderstanding. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427809596-29559-2-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf kmem: Respect -i optionJiri Olsa
Currently the perf kmem does not respect -i option. Initializing the file.path properly after options get parsed. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1428298576-9785-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08tools lib traceevent: Honor operator priorityNamhyung Kim
Currently it ignores operator priority and just sets processed args as a right operand. But it could result in priority inversion in case that the right operand is also a operator arg and its priority is lower. For example, following print format is from new kmem events. "page=%p", REC->pfn != -1UL ? (((struct page *)(0xffffea0000000000UL)) + (REC->pfn)) : ((void *)0) But this was treated as below: REC->pfn != ((null - 1UL) ? ((struct page *)0xffffea0000000000UL + REC->pfn) : (void *) 0) In this case, the right arg was '?' operator which has lower priority. But it just sets the whole arg so making the output confusing - page was always 0 or 1 since that's the result of logical operation. With this patch, it can handle it properly like following: ((REC->pfn != (null - 1UL)) ? ((struct page *)0xffffea0000000000UL + REC->pfn) : (void *) 0) Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1428298576-9785-10-git-send-email-namhyung@kernel.org [ Replaced 'swap' with 'rotate' in a comment as requested by Steve and agreed by Namhyung ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf kmaps: Check kmaps to make code more robustWang Nan
This patch add checks in places where map__kmap is used to get kmaps from struct kmap. Error messages are added at map__kmap to warn invalid accessing of kmap (for the case of !map->dso->kernel, kmap(map) does not exists at all). Also, introduces map__kmaps() to warn uninitialized kmaps. Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: pi3orama@163.com Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1428394966-131044-2-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08perf evlist: Fix inverted logic in perf_mmap__emptyHe Kuang
perf_evlist__mmap_consume() uses perf_mmap__empty() to judge whether perf_mmap is empty and can be released. But the result is inverted so fix it. Signed-off-by: He Kuang <hekuang@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1428399071-7141-1-git-send-email-hekuang@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-04-08regulator: qcom: Tidy up probe()Bjorn Andersson
Tidy up error reporting and move rpm reference retrieval out of the for loop for improved readability. Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08regulator: qcom: Rework to single platform deviceBjorn Andersson
Modeling the individual RPM resources as platform devices consumes at least 12-15kb of RAM, just to hold the platform_device structs. Rework this to instead have one device per pmic exposed by the RPM. With this representation we can more accurately define the input pins on the pmic and have the supply description match the data sheet. Suggested-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08regulator: qcom: Refactor of-parsing codeBjorn Andersson
Refactor out all custom property parsing code from the probe function into a function suitable for regulator_desc->of_parse_cb usage. Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08regulator: qcom: Don't enable DRMS in driverBjorn Andersson
The driver itself should not flag regulators as being DRMS compatible, this should come from board or dt files. Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08regulator: max8660: fix assignment of pdata to data that becomes deadColin Ian King
pdata is assigned to &pdata_of, however, pdata_of becomes dead (when it goes out of scope) so pdata effectively becomes a dead pointer to the out of scope object. This is detected by static analysis: [drivers/regulator/max8660.c:411]: (error) Dead pointer usage. Pointer 'pdata' is dead if it has been assigned '&pdata_of' at line 404. Move declaration of pdata_of so it is always in scope. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08spi: img-spfi: Setup TRANSACTION register before CONTROL registerSifan Naeem
Setting the transfer length in the TRANSACTION register after the CONTROL register is programmed causes intermittent timeout issues in SPFI transfers when using the SPI framework to control the CS GPIO lines. To avoid this issue, set transfer length before programming the CONTROL register. Signed-off-by: Sifan Naeem <sifan.naeem@imgtec.com> Signed-off-by: Andrew Bresticker <abrestic@chromium.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-04-08mmc: core: Convert the error field in struct mmc_command|data into an intUlf Hansson
Everybody expects the error field in the struct mmc_command|data to be and int but it's actually an unsigned int. Let's convert it into an int to meet the expectations. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2015-04-08mmc: sdhci-of-arasan: Call OF parsing for MMCMichal Simek
Also check MMC OF properties. The controller supports MMC too. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2015-04-08mmc: sdhci-pci: fix 64 BIT DMA quirks for rtsxMicky Ching
rts5250 chip failed handle 64 bit ADMA for address below 4G. Add 64 BIT quirks to disable this feature. Signed-off-by: Micky Ching <micky_ching@realsil.com.cn> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2015-04-08ALSA: hda/realtek - Make more stable to get pin sense for ALC283Kailang Yang
Pin sense will active when power pin is wake up. Power pin will not wake up immediately during resume state. Add some delay to wait for power pin activated. Signed-off-by: Kailang Yang <kailang@realtek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2015-04-08kvm: mmu: lazy collapse small sptes into large sptesWanpeng Li
Dirty logging tracks sptes in 4k granularity, meaning that large sptes have to be split. If live migration is successful, the guest in the source machine will be destroyed and large sptes will be created in the destination. However, the guest continues to run in the source machine (for example if live migration fails), small sptes will remain around and cause bad performance. This patch introduce lazy collapsing of small sptes into large sptes. The rmap will be scanned in ioctl context when dirty logging is stopped, dropping those sptes which can be collapsed into a single large-page spte. Later page faults will create the large-page sptes. Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: Clear CR2 on VCPU resetNadav Amit
CR2 is not cleared as it should after reset. See Intel SDM table named "IA-32 Processor States Following Power-up, Reset, or INIT". Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1427933438-12782-5-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: DR0-DR3 are not clear on resetNadav Amit
DR0-DR3 are not cleared as they should during reset and when they are set from userspace. It appears to be caused by c77fb5fe6f03 ("KVM: x86: Allow the guest to run with dirty debug registers"). Force their reload on these situations. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: BSP in MSR_IA32_APICBASE is writableNadav Amit
After reset, the CPU can change the BSP, which will be used upon INIT. Reset should return the BSP which QEMU asked for, and therefore handled accordingly. To quote: "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated." [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions] Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: simplify kvm_apic_mapRadim Krčmář
recalculate_apic_map() uses two passes over all VCPUs. This is a relic from time when we selected a global mode in the first pass and set up the optimized table in the second pass (to have a consistent mode). Recent changes made mixed mode unoptimized and we can do it in one pass. Format of logical MDA is a function of the mode, so we encode it in apic_logical_id() and drop obsoleted variables from the struct. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com> [Add lid_bits temporary in apic_logical_id. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: avoid logical_map when it is invalidRadim Krčmář
We want to support mixed modes and the easiest solution is to avoid optimizing those weird and unlikely scenarios. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com> [Add comment above KVM_APIC_MODE_* defines. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: fix mixed APIC mode broadcastRadim Krčmář
Broadcast allowed only one global APIC mode, but mixed modes are theoretically possible. x2APIC IPI doesn't mean 0xff as broadcast, the rest does. x2APIC broadcasts are accepted by xAPIC. If we take SDM to be logical, even addreses beginning with 0xff should be accepted, but real hardware disagrees. This patch aims for simple code by considering most of real behavior as undefined. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: use MDA for interrupt matchingRadim Krčmář
In mixed modes, we musn't deliver xAPIC IPIs like x2APIC and vice versa. Instead of preserving the information in apic_send_ipi(), we regain it by converting all destinations into correct MDA in the slow path. This allows easier reasoning about subsequent matching. Our kvm_apic_broadcast() had an interesting design decision: it didn't consider IOxAPIC 0xff as broadcast in x2APIC mode ... everything worked because IOxAPIC can't set that in physical mode and logical mode considered it as a message for first 8 VCPUs. This patch interprets IOxAPIC 0xff as x2APIC broadcast. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <1423766494-26150-2-git-send-email-rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08kvm/ppc/mpic: drop unused IRQ_testbitArseny Solokha
Drop unused static procedure which doesn't have callers within its translation unit. It had been already removed independently in QEMU[1] from the OpenPIC implementation borrowed from the kernel. [1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.html Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru> Cc: Alexander Graf <agraf@suse.de> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1424768706-23150-3-git-send-email-asolokha@kb.kras.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: nVMX: remove unnecessary double caching of MAXPHYADDREugene Korenevsky
After speed-up of cpuid_maxphyaddr() it can be called frequently: instead of heavyweight enumeration of CPUID entries it returns a cached pre-computed value. It is also inlined now. So caching its result became unnecessary and can be removed. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Message-Id: <20150329205644.GA1258@gnote> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entryEugene Korenevsky
On each VM-entry CPU should check the following VMCS fields for zero bits beyond physical address width: - APIC-access address - virtual-APIC address - posted-interrupt descriptor address This patch adds these checks required by Intel SDM. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Message-Id: <20150329205627.GA1244@gnote> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpuEugene Korenevsky
cpuid_maxphyaddr(), which performs lot of memory accesses is called extensively across KVM, especially in nVMX code. This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the pressure onto CPU cache and simplify the code of cpuid_maxphyaddr() callers. The cached value is initialized in kvm_arch_vcpu_init() and reloaded every time CPUID is updated by usermode. It is obvious that these reloads occur infrequently. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Message-Id: <20150329205612.GA1223@gnote> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: vmx: pass error code with internal error #2Radim Krčmář
Exposing the on-stack error code with internal error is cheap and potentially useful. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <1428001865-32280-1-git-send-email-rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08x86: vdso: fix pvclock races with task migrationRadim Krčmář
If we were migrated right after __getcpu, but before reading the migration_count, we wouldn't notice that we read TSC of a different VCPU, nor that KVM's bug made pvti invalid, as only migration_count on source VCPU is increased. Change vdso instead of updating migration_count on destination. Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Fixes: 0a4e6be9ca17 ("x86: kvm: Revert "remove sched notifier for cross-cpu migrations"") Message-Id: <1428000263-11892-1-git-send-email-rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: remove kvm_read_hva and kvm_read_hva_atomicPaolo Bonzini
The corresponding write functions just use __copy_to_user. Do the same on the read side. This reverts what's left of commit 86ab8cffb498 (KVM: introduce gfn_to_hva_read/kvm_read_hva/kvm_read_hva_atomic, 2012-08-21) Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1427976500-28533-1-git-send-email-pbonzini@redhat.com>
2015-04-08KVM: x86: optimize delivery of TSC deadline timer interruptPaolo Bonzini
The newly-added tracepoint shows the following results on the tscdeadline_latency test: qemu-kvm-8387 [002] 6425.558974: kvm_vcpu_wakeup: poll time 10407 ns qemu-kvm-8387 [002] 6425.558984: kvm_vcpu_wakeup: poll time 0 ns qemu-kvm-8387 [002] 6425.561242: kvm_vcpu_wakeup: poll time 10477 ns qemu-kvm-8387 [002] 6425.561251: kvm_vcpu_wakeup: poll time 0 ns and so on. This is because we need to go through kvm_vcpu_block again after the timer IRQ is injected. Avoid it by polling once before entering kvm_vcpu_block. On my machine (Xeon E5 Sandy Bridge) this removes about 500 cycles (7%) from the latency of the TSC deadline timer. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: extract blocking logic from __vcpu_runPaolo Bonzini
Rename the old __vcpu_run to vcpu_run, and extract part of it to a new function vcpu_block. The next patch will add a new condition in vcpu_block, avoid extra indentation. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08kvm: x86: fix x86 eflags fixed bitWanpeng Li
Guest can't be booted w/ ept=0, there is a message dumped as below: If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX=00000011 EBX=f000d2f6 ECX=00006cac EDX=000f8956 ESI=bffbdf62 EDI=00000000 EBP=00006c68 ESP=00006c68 EIP=0000d187 EFL=00000004 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =e000 000e0000 ffffffff 00809300 DPL=0 DS16 [-WA] CS =f000 000f0000 ffffffff 00809b00 DPL=0 CS16 [-RA] SS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA] DS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA] FS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA] GS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f6a80 00000037 IDT= 000f6abe 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=01 1e b8 6a 2e 0f 01 16 74 6a 0f 20 c0 66 83 c8 01 0f 22 c0 <66> ea 8f d1 0f 00 08 00 b8 10 00 00 00 8e d8 8e c0 8e d0 8e e0 8e e8 89 c8 ff e2 89 c1 b8X X86 eflags bit 1 is fixed set, which means that 1 << 1 is set instead of 1, this patch fix it. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Message-Id: <1428473294-6633-1-git-send-email-wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08cfg80211: don't allow disabling WEXT if it's requiredJohannes Berg
The change to only export WEXT symbols when required could break the build if CONFIG_CFG80211_WEXT was explicitly disabled while a driver like orinoco selected it. Fix this by hiding the symbol when it's required so it can't be disabled in that case. Fixes: 2afe38d15cee ("cfg80211-wext: export symbols only when needed") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-08x86/asm/entry/64: Add forgotten CFI annotationDenys Vlasenko
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1428424967-14460-1-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-08x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layoutDenys Vlasenko
Interrupt entry points are handled with the following code, each 32-byte code block contains seven entry points: ... [push][jump 22] // 4 bytes [push][jump 18] // 4 bytes [push][jump 14] // 4 bytes [push][jump 10] // 4 bytes [push][jump 6] // 4 bytes [push][jump 2] // 4 bytes [push][jump common_interrupt][padding] // 8 bytes [push][jump] [push][jump] [push][jump] [push][jump] [push][jump] [push][jump] [push][jump common_interrupt][padding] [padding_2] common_interrupt: And there is a table which holds pointers to every entry point, IOW: to every push. In cold cache, two jumps are still costlier than one, even though we get the benefit of them residing in the same cacheline. This change replaces short jumps with near ones to 'common_interrupt', and pads every push+jump pair to 8 bytes. This way, each interrupt takes only one jump. This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table with ".align 8" - we do not need anything stronger than that. The table of entry addresses (the interrupt[] array) is no longer necessary, the address of entries can be easily calculated as (irq_entries_start + i*8). text data bss dec hex filename 12546 0 0 12546 3102 entry_64.o.before 11626 0 0 11626 2d6a entry_64.o The size decrease is because 1656 bytes of .init.rodata are gone. That's initdata, though. The resident size does go up a bit. Run-tested (32 and 64 bits). Acked-and-Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1428090553-7283-1-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>