git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2014-04-24	kprobes, x86: Prohibit probing on debug_stack_*()	Masami Hiramatsu
	Prohibit probing on debug_stack_reset and debug_stack_set_zero. Since the both functions are called from TRACE_IRQS_ON/OFF_DEBUG macros which run in int3 ist entry, probing it may cause a soft lockup. This happens when the kernel built with CONFIG_DYNAMIC_FTRACE=y and CONFIG_TRACE_IRQFLAGS=y. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Beulich <JBeulich@suse.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/20140417081712.26341.32994.stgit@ltc230.yrl.intra.hitachi.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24	kprobes: Introduce NOKPROBE_SYMBOL() macro to maintain kprobes blacklist	Masami Hiramatsu
	Introduce NOKPROBE_SYMBOL() macro which builds a kprobes blacklist at kernel build time. The usage of this macro is similar to EXPORT_SYMBOL(), placed after the function definition: NOKPROBE_SYMBOL(function); Since this macro will inhibit inlining of static/inline functions, this patch also introduces a nokprobe_inline macro for static/inline functions. In this case, we must use NOKPROBE_SYMBOL() for the inline function caller. When CONFIG_KPROBES=y, the macro stores the given function address in the "_kprobe_blacklist" section. Since the data structures are not fully initialized by the macro (because there is no "size" information), those are re-initialized at boot time by using kallsyms. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Alok Kataria <akataria@vmware.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christopher Li <sparse@chrisli.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David S. Miller <davem@davemloft.net> Cc: Jan-Simon Möller <dl9pf@gmx.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-sparse@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24	kprobes: Prohibit probing on .entry.text code	Masami Hiramatsu
	.entry.text is a code area which is used for interrupt/syscall entries, which includes many sensitive code. Thus, it is better to prohibit probing on all of such code instead of a part of that. Since some symbols are already registered on kprobe blacklist, this also removes them from the blacklist. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/20140417081658.26341.57354.stgit@ltc230.yrl.intra.hitachi.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24	kprobes/x86: Allow to handle reentered kprobe on single-stepping	Masami Hiramatsu
	Since the NMI handlers(e.g. perf) can interrupt in the single stepping (or preparing the single stepping, do_debug etc.), we should consider a kprobe is hit in the NMI handler. Even in that case, the kprobe is allowed to be reentered as same as the kprobes hit in kprobe handlers (KPROBE_HIT_ACTIVE or KPROBE_HIT_SSDONE). The real issue will happen when a kprobe hit while another reentered kprobe is processing (KPROBE_REENTER), because we already consumed a saved-area for the previous kprobe. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Lebon <jlebon@redhat.com> Link: http://lkml.kernel.org/r/20140417081651.26341.10593.stgit@ltc230.yrl.intra.hitachi.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24	perf/x86: Fix RAPL rdmsrl_safe() usage	Stephane Eranian
	This patch fixes a bug introduced by: 24223657806a ("perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU") The rdmsrl_safe() function returns 0 on success. The current code was failing to detect the RAPL PMU on real hardware (missing /sys/devices/power) because the return value of rdmsrl_safe() was misinterpreted. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Venkatesh Srinivas <venkateshs@google.com> Cc: peterz@infradead.org Cc: zheng.z.yan@intel.com Link: http://lkml.kernel.org/r/20140423170418.GA12767@quad Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-23	KVM: MMU: flush tlb out of mmu lock when write-protect the sptes	Xiao Guangrong
	Now we can flush all the TLBs out of the mmu lock without TLB corruption when write-proect the sptes, it is because: - we have marked large sptes readonly instead of dropping them that means we just change the spte from writable to readonly so that we only need to care the case of changing spte from present to present (changing the spte from present to nonpresent will flush all the TLBs immediately), in other words, the only case we need to care is mmu_spte_update() - in mmu_spte_update(), we haved checked SPTE_HOST_WRITEABLE \| PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK anymore Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: MMU: flush tlb if the spte can be locklessly modified	Xiao Guangrong
	Relax the tlb flush condition since we will write-protect the spte out of mmu lock. Note lockless write-protection only marks the writable spte to readonly and the spte can be writable only if both SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable) This patch is used to avoid this kind of race: VCPU 0 VCPU 1 lockless wirte protection: set spte.w = 0 lock mmu-lock write protection the spte to sync shadow page, see spte.w = 0, then without flush tlb unlock mmu-lock !!! At this point, the shadow page can still be writable due to the corrupt tlb entry Flush all TLB Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: MMU: lazily drop large spte	Xiao Guangrong
	Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them un-present, the page fault caused by read access can be avoided The idea is from Avi: \| As I mentioned before, write-protecting a large spte is a good idea, \| since it moves some work from protect-time to fault-time, so it reduces \| jitter. This removes the need for the return value. This version has fixed the issue reported in 6b73a9606, the reason of that issue is that fast_page_fault() directly sets the readonly large spte to writable but only dirty the first page into the dirty-bitmap that means other pages are missed. Fixed it by only the normal sptes (on the PT_PAGE_TABLE_LEVEL level) can be fast fixed Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: MMU: properly check last spte in fast_page_fault()	Xiao Guangrong
	Using sp->role.level instead of @level since @level is not got from the page table hierarchy There is no issue in current code since the fast page fault currently only fixes the fault caused by dirty-log that is always on the last level (level = 1) This patch makes the code more readable and avoids potential issue in the further development Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	Revert "KVM: Simplify kvm->tlbs_dirty handling"	Xiao Guangrong
	This reverts commit 5befdc385ddb2d5ae8995ad89004529a3acf58fc. Since we will allow flush tlb out of mmu-lock in the later patch Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: x86: Processor mode may be determined incorrectly	Nadav Amit
	If EFER.LMA is off, cs.l does not determine execution mode. Currently, the emulation engine assumes differently. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: x86: IN instruction emulation should ignore REP-prefix	Nadav Amit
	The IN instruction is not be affected by REP-prefix as INS is. Therefore, the emulation should ignore the REP prefix as well. The current emulator implementation tries to perform writeback when IN instruction with REP-prefix is emulated. This causes it to perform wrong memory write or spurious #GP exception to be injected to the guest. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: x86: Fix CR3 reserved bits	Nadav Amit
	According to Intel specifications, PAE and non-PAE does not have any reserved bits. In long-mode, regardless to PCIDE, only the high bits (above the physical address) are reserved. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23	KVM: x86: Fix wrong/stuck PMU when guest does not use PMI	Nadav Amit
	If a guest enables a performance counter but does not enable PMI, the hypervisor currently does not reprogram the performance counter once it overflows. As a result the host performance counter is kept with the original sampling period which was configured according to the value of the guest's counter when the counter was enabled. Such behaviour can cause very bad consequences. The most distrubing one can cause the guest not to make any progress at all, and keep exiting due to host PMI before any guest instructions is exeucted. This situation occurs when the performance counter holds a very high value when the guest enables the performance counter. As a result the host's sampling period is configured to be very short. The host then never reconfigures the sampling period and get stuck at entry->PMI->exit loop. We encountered such a scenario in our experiments. The solution is to reprogram the counter even if the guest does not use PMI. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22	KVM: nVMX: Advertise support for interrupt acknowledgement	Bandan Das
	Some Type 1 hypervisors such as XEN won't enable VMX without it present Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22	KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to	Bandan Das
	This feature emulates the "Acknowledge interrupt on exit" behavior. We can safely emulate it for L1 to run L2 even if L0 itself has it disabled (to run L1). Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22	KVM: nVMX: Don't advertise single context invalidation for invept	Bandan Das
	For single context invalidation, we fall through to global invalidation in handle_invept() except for one case - when the operand supplied by L1 is different from what we have in vmcs12. However, typically hypervisors will only call invept for the currently loaded eptp, so the condition will never be true. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22	KVM: VMX: Advance rip to after an ICEBP instruction	Huw Davies
	When entering an exception after an ICEBP, the saved instruction pointer should point to after the instruction. This fixes the bug here: https://bugs.launchpad.net/qemu/+bug/1119686 Signed-off-by: Huw Davies <huw@codeweavers.com> Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22	Merge branch 'x86-vdso-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso fix from Peter Anvin: "This is a single build fix for building with gold as opposed to GNU ld. It got queued up separately and was expected to be pushed during the merge window, but it got left behind" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, vdso: Make the vdso linker script compatible with Gold
2014-04-22	x86: LLVMLinux: Wrap -mno-80387 with cc-option	Behan Webster
	Wrap -mno-80387 gcc options with cc-option so they don't break clang. Signed-off-by: Behan Webster <behanw@converseincode.com> Cc: torvalds@linux-foundation.org Cc: dwmw2@infradead.org Cc: pageexec@freemail.hu Link: http://lkml.kernel.org/r/1398145227-25053-1-git-send-email-behanw@converseincode.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-21	KVM: x86: Fix CR3 and LDT sel should not be saved in TSS	Nadav Amit
	According to Intel specifications, only general purpose registers and segment selectors should be saved in the old TSS during 32-bit task-switch. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-21	ftrace/x86: Fix order of warning messages when ftrace modifies code	Petr Mladek
	The colon at the end of the printk message suggests that it should get printed before the details printed by ftrace_bug(). When touching the line, let's use the preferred pr_warn() macro as suggested by checkpatch.pl. Link: http://lkml.kernel.org/r/1392650573-3390-5-git-send-email-pmladek@suse.cz Signed-off-by: Petr Mladek <pmladek@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-04-19	Merge branch 'x86-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "This fixes the preemption-count imbalance crash reported by Owen Kibel" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix CMCI preemption bugs
2014-04-19	Merge branch 'perf-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two kernel side fixes: - an Intel uncore PMU driver potential crash fix - a kprobes/perf-call-graph interaction fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU kprobes/x86: Fix page-fault handling logic
2014-04-18	arch,x86: Convert smp_mb__*()	Peter Zijlstra
	x86 is strongly ordered and all its atomic ops imply a full barrier. Implement the two new primitives as the old ones were. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-knswsr5mldkr0w1lrdxvc81w@git.kernel.org Cc: Dave Jones <davej@redhat.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18	perf/x86: Export perf_assign_events()	Yan, Zheng
	export perf_assign_events to allow building perf Intel uncore driver as module Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1395133004-23205-3-git-send-email-zheng.z.yan@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: eranian@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18	Merge branch 'perf/urgent' into perf/core, to pick up PMU driver fixes.	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18	perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU	Venkatesh Srinivas
	CPUs which should support the RAPL counters according to Family/Model/Stepping may still issue #GP when attempting to access the RAPL MSRs. This may happen when Linux is running under KVM and we are passing-through host F/M/S data, for example. Use rdmsrl_safe to first access the RAPL_POWER_UNIT MSR; if this fails, do not attempt to use this PMU. Signed-off-by: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1394739386-22260-1-git-send-email-venkateshs@google.com Cc: zheng.z.yan@intel.com Cc: eranian@google.com Cc: ak@linux.intel.com Cc: linux-kernel@vger.kernel.org [ The patch also silently fixes another bug: rapl_pmu_init() didn't handle the memory alloc failure case previously. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17	uprobes/x86: Emulate relative conditional "near" jmp's	Oleg Nesterov
	Change branch_setup_xol_ops() to simply use opc1 = OPCODE2(insn) - 0x10 if OPCODE1() == 0x0f; this matches the "short" jmp which checks the same condition. Thanks to lib/insn.c, it does the rest correctly. branch->ilen/offs are correct no matter if this jmp is "near" or "short". Reported-by: Jonathan Lebon <jlebon@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Emulate relative conditional "short" jmp's	Oleg Nesterov
	Teach branch_emulate_op() to emulate the conditional "short" jmp's which check regs->flags. Note: this doesn't support jcxz/jcexz, loope/loopz, and loopne/loopnz. They all are rel8 and thus they can't trigger the problem, but perhaps we will add the support in future just for completeness. Reported-by: Jonathan Lebon <jlebon@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Emulate relative call's	Oleg Nesterov
	See the previous "Emulate unconditional relative jmp's" which explains why we can not execute "jmp" out-of-line, the same applies to "call". Emulating of rip-relative call is trivial, we only need to additionally push the ret-address. If this fails, we execute this instruction out of line and this should trigger the trap, the probed application should die or the same insn will be restarted if a signal handler expands the stack. We do not even need ->post_xol() for this case. But there is a corner (and almost theoretical) case: another thread can expand the stack right before we execute this insn out of line. In this case it hit the same problem we are trying to solve. So we simply turn the probed insn into "call 1f; 1:" and add ->post_xol() which restores ->sp and restarts. Many thanks to Jonathan who finally found the standalone reproducer, otherwise I would never resolve the "random SIGSEGV's under systemtap" bug-report. Now that the problem is clear we can write the simplified test-case: void probe_func(void), callee(void); int failed = 1; asm ( ".text\n" ".align 4096\n" ".globl probe_func\n" "probe_func:\n" "call callee\n" "ret" ); /* * This assumes that: * * - &probe_func = 0x401000 + a_bit, aligned = 0x402000 * * - xol_vma->vm_start = TASK_SIZE_MAX - PAGE_SIZE = 0x7fffffffe000 * as xol_add_vma() asks; the 1st slot = 0x7fffffffe080 * * so we can target the non-canonical address from xol_vma using * the simple math below, 100 * 4096 is just the random offset / asm (".org . + 0x800000000000 - 0x7fffffffe080 - 5 - 1 + 100 4096\n"); void callee(void) { failed = 0; } int main(void) { probe_func(); return failed; } It SIGSEGV's if you probe "probe_func" (although this is not very reliable, randomize_va_space/etc can change the placement of xol area). Note: as Denys Vlasenko pointed out, amd and intel treat "callw" (0x66 0xe8) differently. This patch relies on lib/insn.c and thus implements the intel's behaviour: 0x66 is simply ignored. Fortunately nothing sane should ever use this insn, so we postpone the fix until we decide what should we do; emulate or not, support or not, etc. Reported-by: Jonathan Lebon <jlebon@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Emulate nop's using ops->emulate()	Oleg Nesterov
	Finally we can kill the ugly (and very limited) code in __skip_sstep(). Just change branch_setup_xol_ops() to treat "nop" as jmp to the next insn. Thanks to lib/insn.c, it is clever enough. OPCODE1() == 0x90 includes "(rep;)+ nop;" at least, and (afaics) much more. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Emulate unconditional relative jmp's	Oleg Nesterov
	Currently we always execute all insns out-of-line, including relative jmp's and call's. This assumes that even if regs->ip points to nowhere after the single-step, default_post_xol_op(UPROBE_FIX_IP) logic will update it correctly. However, this doesn't work if this regs->ip == xol_vaddr + insn_offset is not canonical. In this case CPU generates #GP and general_protection() kills the task which tries to execute this insn out-of-line. Now that we have uprobe_xol_ops we can teach uprobes to emulate these insns and solve the problem. This patch adds branch_xol_ops which has a single branch_emulate_op() hook, so far it can only handle rel8/32 relative jmp's. TODO: move ->fixup into the union along with rip_rela_target_address. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jonathan Lebon <jlebon@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and ↵	Oleg Nesterov
	arch_uretprobe_hijack_return_addr() 1. Add the trivial sizeof_long() helper and change other callers of is_ia32_task() to use it. TODO: is_ia32_task() is not what we actually want, TS_COMPAT does not necessarily mean 32bit. Fortunately syscall-like insns can't be probed so it actually works, but it would be better to rename and use is_ia32_frame(). 2. As Jim pointed out "ncopied" in arch_uretprobe_hijack_return_addr() and adjust_ret_addr() should be named "nleft". And in fact only the last copy_to_user() in arch_uretprobe_hijack_return_addr() actually needs to inspect the non-zero error code. TODO: adjust_ret_addr() should die. We can always calculate the value we need to write into *regs->sp, just UPROBE_FIX_CALL should record insn->length. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Teach arch_uprobe_post_xol() to restart if possible	Oleg Nesterov
	SIGILL after the failed arch_uprobe_post_xol() should only be used as a last resort, we should try to restart the probed insn if possible. Currently only adjust_ret_addr() can fail, and this can only happen if another thread unmapped our stack after we executed "call" out-of-line. Most probably the application if buggy, but even in this case it can have a handler for SIGSEGV/etc. And in theory it can be even correct and do something non-trivial with its memory. Of course we can't restart unconditionally, so arch_uprobe_post_xol() does this only if ->post_xol() returns -ERESTART even if currently this is the only possible error. default_post_xol_op(UPROBE_FIX_CALL) can always restart, but as Jim pointed out it should not forget to pop off the return address pushed by this insn executed out-of-line. Note: this is not "perfect", we do not want the extra handler_chain() after restart, but I think this is the best solution we can realistically do without too much uglifications. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails	Oleg Nesterov
	Currently the error from arch_uprobe_post_xol() is silently ignored. This doesn't look good and this can lead to the hard-to-debug problems. 1. Change handle_singlestep() to loudly complain and send SIGILL. Note: this only affects x86, ppc/arm can't fail. 2. Change arch_uprobe_post_xol() to call arch_uprobe_abort_xol() and avoid TF games if it is going to return an error. This can help to to analyze the problem, if nothing else we should not report ->ip = xol_slot in the core-file. Note: this means that handle_riprel_post_xol() can be called twice, but this is fine because it is idempotent. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Conditionalize the usage of handle_riprel_insn()	Oleg Nesterov
	arch_uprobe_analyze_insn() calls handle_riprel_insn() at the start, but only "0xff" and "default" cases need the UPROBE_FIX_RIP_ logic. Move the callsite into "default" case and change the "0xff" case to fall-through. We are going to add the various hooks to handle the rip-relative jmp/call instructions (and more), we need this change to enforce the fact that the new code can not conflict with is_riprel_insn() logic which, after this change, can only be used by default_xol_ops. Note: arch_uprobe_abort_xol() still calls handle_riprel_post_xol() directly. This is fine unless another _xol_ops we may add later will need to reuse "UPROBE_FIX_RIP_AX\|UPROBE_FIX_RIP_CX" bits in ->fixup. In this case we can add uprobe_xol_ops->abort() hook, which (perhaps) we will need anyway in the long term. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
2014-04-17	uprobes/x86: Introduce uprobe_xol_ops and arch_uprobe->ops	Oleg Nesterov
	Introduce arch_uprobe->ops pointing to the "struct uprobe_xol_ops", move the current UPROBE_FIX_{RIP*,IP,CALL} code into the default set of methods and change arch_uprobe_pre/post_xol() accordingly. This way we can add the new uprobe_xol_ops's to handle the insns which need the special processing (rip-relative jmp/call at least). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
2014-04-17	uprobes/x86: move the UPROBE_FIX_{RIP,IP,CALL} code at the end of pre/post hooks	Oleg Nesterov
	No functional changes. Preparation to simplify the review of the next change. Just reorder the code in arch_uprobe_pre/post_xol() functions so that UPROBE_FIX_{RIP_*,IP,CALL} logic goes to the end. Also change arch_uprobe_pre_xol() to use utask instead of autask, to make the code more symmetrical with arch_uprobe_post_xol(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2014-04-17	uprobes/x86: Gather "riprel" functions together	Oleg Nesterov
	Cosmetic. Move pre_xol_rip_insn() and handle_riprel_post_xol() up to the closely related handle_riprel_insn(). This way it is simpler to read and understand this code, and this lessens the number of ifdef's. While at it, update the comment in handle_riprel_post_xol() as Jim suggested. TODO: rename them somehow to make the naming consistent. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
2014-04-17	uprobes/x86: Kill the "ia32_compat" check in handle_riprel_insn(), remove ↵	Oleg Nesterov
	"mm" arg Kill the "mm->context.ia32_compat" check in handle_riprel_insn(), if it is true insn_rip_relative() must return false. validate_insn_bits() passed "ia32_compat" as !x86_64 to insn_init(), and insn_rip_relative() checks insn->x86_64. Also, remove the no longer needed "struct mm_struct *mm" argument and the unnecessary "return" at the end. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2014-04-17	uprobes/x86: Fold prepare_fixups() into arch_uprobe_analyze_insn()	Oleg Nesterov
	No functional changes, preparation. Shift the code from prepare_fixups() to arch_uprobe_analyze_insn() with the following modifications: - Do not call insn_get_opcode() again, it was already called by validate_insn_bits(). - Move "case 0xea" up. This way "case 0xff" can fall through to default case. - change "case 0xff" to use the nested "switch (MODRM_REG)", this way the code looks a bit simpler. - Make the comments look consistent. While at it, kill the initialization of rip_rela_target_address and ->fixups, we can rely on kzalloc(). We will add the new members into arch_uprobe, it would be better to assume that everything is zero by default. TODO: cleanup/fix the mess in validate_insn_bits() paths: - validate_insn_64bits() and validate_insn_32bits() should be unified. - "ifdef" is not used consistently; if good_insns_64 depends on CONFIG_X86_64, then probably good_insns_32 should depend on CONFIG_X86_32/EMULATION - the usage of mm->context.ia32_compat looks wrong if the task is TIF_X32. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jim Keniston <jkenisto@us.ibm.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2014-04-17	Merge tag 'stable/for-linus-3.15-rc1-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen fixes from David Vrabel: "Xen regression and bug fixes for 3.15-rc1: - fix completely broken 32-bit PV guests caused by x86 refactoring 32-bit thread_info. - only enable ticketlock slow path on Xen (not bare metal) - fix two bugs with PV guests not shutting down when requested - fix a minor memory leak in xen-pciback error path" * tag 'stable/for-linus-3.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/manage: Poweroff forcefully if user-space is not yet up. xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart. xen/spinlock: Don't enable them unconditionally. xen-pciback: silence an unwanted debug printk xen: fix memory leak in __xen_pcibk_add_pci_dev() x86/xen: Fix 32-bit PV guests's usage of kernel_stack
2014-04-17	KVM: VMX: speed up wildcard MMIO EVENTFD	Michael S. Tsirkin
	With KVM, MMIO is much slower than PIO, due to the need to do page walk and emulation. But with EPT, it does not have to be: we know the address from the VMCS so if the address is unique, we can look up the eventfd directly, bypassing emulation. Unfortunately, this only works if userspace does not need to match on access length and data. The implementation adds a separate FAST_MMIO bus internally. This serves two purposes: - minimize overhead for old userspace that does not use eventfd with lengtth = 0 - minimize disruption in other code (since we don't know the length, devices on the MMIO bus only get a valid address in write, this way we don't need to touch all devices to teach them to handle an invalid length) At the moment, this optimization only has effect for EPT on x86. It will be possible to speed up MMIO for NPT and MMU using the same idea in the future. With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO. I was unable to detect any measureable slowdown to non-eventfd MMIO. Making MMIO faster is important for the upcoming virtio 1.0 which includes an MMIO signalling capability. The idea was suggested by Peter Anvin. Lots of thanks to Gleb for pre-review and suggestions. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17	KVM: support any-length wildcard ioeventfd	Michael S. Tsirkin
	It is sometimes benefitial to ignore IO size, and only match on address. In hindsight this would have been a better default than matching length when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind of access can be optimized on VMX: there no need to do page lookups. This can currently be done with many ioeventfds but in a suboptimal way. However we can't change kernel/userspace ABI without risk of breaking some applications. Use len = 0 to mean "ignore length for matching" in a more optimal way. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17	x86/efi: Save and restore FPU context around efi_calls (i386)	Ricardo Neri
	Do a complete FPU context save/restore around the EFI calls. This required as runtime EFI firmware may potentially use the FPU. This change covers only the i386 configuration. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-04-17	x86/efi: Save and restore FPU context around efi_calls (x86_64)	Ricardo Neri
	Do a complete FPU context save/restore around the EFI calls. This required as runtime EFI firmware may potentially use the FPU. This change covers only the x86_64 configuration. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-04-17	x86/efi: Implement a __efi_call_virt macro	Ricardo Neri
	For i386, all the EFI system runtime services functions return efi_status_t except efi_reset_system_system. Therefore, not all functions can be covered by the same macro in case the macro needs to do more than calling the function (i.e., return a value). The purpose of the __efi_call_virt macro is to be used when no return value is expected. For x86_64, this macro would not be needed as all the runtime services return u64. However, the same code is used for both x86_64 and i386. Thus, the macro __efi_call_virt is also defined to not break compilation. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-04-17	x86, fpu: Extend the use of static_cpu_has_safe	Matt Fleming
	It may be necessary to save and restore the FPU context during EFI runtime system services calls. However, this may happen during boot and before alternatives have run. Thus, we need to use static_cpu_has_safe instead. The rationale behind the use of static_cpu_has_safe is the same as in commit 5f8c4218148822fde6ee ("x86, fpu: Use static_cpu_has_safe before alternatives") by Borislav Petkov. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Borislav Petkov <bp@suse.de>
2014-04-17	x86/efi: Delete most of the efi_call* macros	Matt Fleming
	We really only need one phys and one virt function call, and then only one assembly function to make firmware calls. Since we are not using the C type system anyway, we're not really losing much by deleting the macros apart from no longer having a check that we are passing the correct number of parameters. The lack of duplicated code seems like a worthwhile trade-off. Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>