linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2014-07-11	x86, vdso: Get rid of the fake section mechanism	Andy Lutomirski
	Now that we can tolerate extra things dangling off the end of the vdso image, we can strip the vdso the old fashioned way rather than using an overcomplicated custom stripping algorithm. This is a partial reversion of: 6f121e5 x86, vdso: Reimplement vdso.so preparation in build-time C Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/50e01ed6dcc0575d20afd782f9fe98d5ee3e2d8a.1405040914.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-11	x86, vdso: Move the vvar area before the vdso text	Andy Lutomirski
	Putting the vvar area after the vdso text is rather complicated: it only works of the total length of the vdso text mapping is known at vdso link time, and the linker doesn't allow symbol addresses to depend on the sizes of non-allocatable data after the PT_LOAD segment. Moving the vvar area before the vdso text will allow is to safely map non-allocatable data after the vdso text, which is a nice simplification. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/156c78c0d93144ff1055a66493783b9e56813983.1405040914.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-11	KVM: x86: Emulator support for #UD on CPL>0	Nadav Amit
	Certain instructions (e.g., mwait and monitor) cause a #UD exception when they are executed in user mode. This is in contrast to the regular privileged instructions which cause #GP. In order not to mess with SVM interception of mwait and monitor which assumes privilege level assertions take place before interception, a flag has been added. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: Emulator flag for instruction that only support 16-bit addresses ↵	Nadav Amit
	in real mode Certain instructions, such as monitor and xsave do not support big real mode and cause a #GP exception if any of the accessed bytes effective address are not within [0, 0xffff]. This patch introduces a flag to mark these instructions, including the necassary checks. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: use kvm_read_guest_page for emulator accesses	Paolo Bonzini
	Emulator accesses are always done a page at a time, either by the emulator itself (for fetches) or because we need to query the MMU for address translations. Speed up these accesses by using kvm_read_guest_page and, in the case of fetches, by inlining kvm_read_guest_virt_helper and dropping the loop around kvm_read_guest_page. This final tweak saves 30-100 more clock cycles (4-10%), bringing the count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series and 925-1700 after the first two low-hanging fruit changes. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: ensure emulator fetches do not span multiple pages	Paolo Bonzini
	When the CS base is not page-aligned, the linear address of the code could get close to the page boundary (e.g. 0x...ffe) even if the EIP value is not. So we need to first linearize the address, and only then compute the number of valid bytes that can be fetched. This happens relatively often when executing real mode code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: put pointers in the fetch_cache	Paolo Bonzini
	This simplifies the code a bit, especially the overflow checks. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: avoid per-byte copying in instruction fetches	Paolo Bonzini
	We do not need a memory copying loop anymore in insn_fetch; we can use a byte-aligned pointer to access instruction fields directly from the fetch_cache. This eliminates 50-150 cycles (corresponding to a 5-10% improvement in performance) from each instruction. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: avoid repeated calls to do_insn_fetch_bytes	Paolo Bonzini
	do_insn_fetch_bytes will only be called once in a given insn_fetch and insn_fetch_arr, because in fact it will only be called at most twice for any instruction and the first call is explicit in x86_decode_insn. This observation lets us hoist the call out of the memory copying loop. It does not buy performance, because most fetches are one byte long anyway, but it prepares for the next patch. The overflow check is tricky, but correct. Because do_insn_fetch_bytes has already been called once, we know that fc->end is at least 15. So it is okay to subtract the number of bytes we want to read. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: speed up do_insn_fetch	Paolo Bonzini
	Hoist the common case up from do_insn_fetch_byte to do_insn_fetch, and prime the fetch_cache in x86_decode_insn. This helps a bit the compiler and the branch predictor, but above all it lays the ground for further changes in the next few patches. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: do not initialize memopp	Bandan Das
	rip_relative is only set if decode_modrm runs, and if you have ModRM you will also have a memopp. We can then access memopp unconditionally. Note that rip_relative cannot be hoisted up to decode_modrm, or you break "mov $0, xyz(%rip)". Also, move typecast on "out of range value" of mem.ea to decode_modrm. Together, all these optimizations save about 50 cycles on each emulated instructions (4-6%). Signed-off-by: Bandan Das <bsd@redhat.com> [Fix immediate operands with rip-relative addressing. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: rework seg_override	Bandan Das
	x86_decode_insn already sets a default for seg_override, so remove it from the zeroed area. Also replace set/get functions with direct access to the field. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: clean up initializations in init_decode_cache	Bandan Das
	A lot of initializations are unnecessary as they get set to appropriate values before actually being used. Optimize placement of fields in x86_emulate_ctxt Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: cleanup decode_modrm	Bandan Das
	Remove the if conditional - that will help us avoid an "else initialize to 0" Also, rearrange operators for slightly better code. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: Remove ctxt->intercept and ctxt->check_perm checks	Bandan Das
	The same information can be gleaned from ctxt->d and avoids having to zero/NULL initialize intercept and check_perm Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: move init_decode_cache to emulate.c	Bandan Das
	Core emulator functions all belong in emulator.c, x86 should have no knowledge of emulator internals Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: simplify writeback	Paolo Bonzini
	The "if/return" checks are useless, because we return X86EMUL_CONTINUE anyway if we do not return. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: speed up emulated moves	Paolo Bonzini
	We can just blindly move all 16 bytes of ctxt->src's value to ctxt->dst. write_register_operand will take care of writing only the lower bytes. Avoiding a call to memcpy (the compiler optimizes it out) gains about 200 cycles on kvm-unit-tests for register-to-register moves, and makes them about as fast as arithmetic instructions. We could perhaps get a larger speedup by moving all instructions _except_ moves out of x86_emulate_insn, removing opcode_len, and replacing the switch statement with an inlined em_mov. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: protect checks on ctxt->d by a common "if (unlikely())"	Paolo Bonzini
	There are several checks for "peculiar" aspects of instructions in both x86_decode_insn and x86_emulate_insn. Group them together, and guard them with a single "if" that lets the processor quickly skip them all. Make this more effective by adding two more flag bits that say whether the .intercept and .check_perm fields are valid. We will reuse these flags later to avoid initializing fields of the emulate_ctxt struct. This skims about 30 cycles for each emulated instructions, which is approximately a 3% improvement. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: emulate: move around some checks	Paolo Bonzini
	The only purpose of this patch is to make the next patch simpler to review. No semantic change. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation	Paolo Bonzini
	Despite the provisions to emulate up to 130 consecutive instructions, in practice KVM will emulate just one before exiting handle_invalid_guest_state, because x86_emulate_instruction always sets KVM_REQ_EVENT. However, we only need to do this if an interrupt could be injected, which happens a) if an interrupt shadow bit (STI or MOV SS) has gone away; b) if the interrupt flag has just been set (other instructions than STI can set it without enabling an interrupt shadow). This cuts another 700-900 cycles from the cost of emulating an instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles before the patch on kvm-unit-tests, 925-1700 afterwards). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: return all bits from get_interrupt_shadow	Paolo Bonzini
	For the next patch we will need to know the full state of the interrupt shadow; we will then set KVM_REQ_EVENT when one bit is cleared. However, right now get_interrupt_shadow only returns the one corresponding to the emulated instruction, or an unconditional 0 if the emulated instruction does not have an interrupt shadow. This is confusing and does not allow us to check for cleared bits as mentioned above. Clean the callback up, and modify toggle_interruptibility to match the comment above the call. As a small result, the call to set_interrupt_shadow will be skipped in the common case where int_shadow == 0 && mask == 0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: vmx: speed up emulation of invalid guest state	Paolo Bonzini
	About 25% of the time spent in emulation of invalid guest state is wasted in checking whether emulation is required for the next instruction. However, this almost never changes except when a segment register (or TR or LDTR) changes, or when there is a mode transition (i.e. CR0 changes). In fact, vmx_set_segment and vmx_set_cr0 already modify vmx->emulation_required (except that the former for some reason uses \|= instead of just an assignment). So there is no need to call guest_state_valid in the emulation loop. Emulation performance test results indicate 1650-2600 cycles for common instructions, versus 2300-3200 before this patch on a Sandy Bridge Xeon. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: svm: writes to MSR_K7_HWCR generates GPE in guest	Matthias Lange
	Since commit 575203 the MCE subsystem in the Linux kernel for AMD sets bit 18 in MSR_K7_HWCR. Running such a kernel as a guest in KVM on an AMD host results in a GPE injected into the guest because kvm_set_msr_common returns 1. This patch fixes this by masking bit 18 from the MSR value desired by the guest. Signed-off-by: Matthias Lange <matthias.lange@kernkonzept.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: x86: Pending interrupt may be delivered after INIT	Nadav Amit
	We encountered a scenario in which after an INIT is delivered, a pending interrupt is delivered, although it was sent before the INIT. As the SDM states in section 10.4.7.1, the ISR and the IRR should be cleared after INIT as KVM does. This also means that pending interrupts should be cleared. This patch clears upon reset (and INIT) the pending interrupts; and at the same occassion clears the pending exceptions, since they may cause a similar issue. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11	KVM: Synthesize G bit for all segments.	Jim Mattson
	We have noticed that qemu-kvm hangs early in the BIOS when runnning nested under some versions of VMware ESXi. The problem we believe is because KVM assumes that the platform preserves the 'G' but for any segment register. The SVM specification itemizes the segment attribute bits that are observed by the CPU, but the (G)ranularity bit is not one of the bits itemized, for any segment. Though current AMD CPUs keep track of the (G)ranularity bit for all segment registers other than CS, the specification does not require it. VMware's virtual CPU may not track the (G)ranularity bit for any segment register. Since kvm already synthesizes the (G)ranularity bit for the CS segment. It should do so for all segments. The patch below does that, and helps get rid of the hangs. Patch applies on top of Linus' tree. Signed-off-by: Jim Mattson <jmattson@vmware.com> Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-10	x86-32, vdso: Fix vDSO build error due to missing align_vdso_addr()	Jan Beulich
	Relying on static functions used just once to get inlined (and subsequently have dead code paths eliminated) is wrong: Compilers are free to decide whether they do this, regardless of optimization level. With this not happening for vdso_addr() (observed with gcc 4.1.x), an unresolved reference to align_vdso_addr() causes the build to fail. [ hpa: vdso_addr() is never actually used on x86-32, as calculate_addr in map_vdso() is always false. It ought to be possible to clean this up further, but this fixes the immediate problem. ] Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/53B5863B02000078000204D5@mail.emea.novell.com Acked-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-10	x86-64, vdso: Fix vDSO build breakage due to empty .rela.dyn	Jan Beulich
	Certain ld versions (observed with 2.20.0) put an empty .rela.dyn section into shared object files, breaking the assumption on the number of sections to be copied to the final output. Simply discard any empty SHT_REL and SHT_RELA sections to address this. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/53B5861E02000078000204D1@mail.emea.novell.com Acked-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-10	x86, ia64: Move EFI_FB vga_default_device() initialization to pci_vga_fixup()	Bruno Prémont
	Commit b4aa0163056b ("efifb: Implement vga_default_device() (v2)") added efifb vga_default_device() so EFI systems that do not load shadow VBIOS or setup VGA get proper value for boot_vga PCI sysfs attribute on the corresponding PCI device. Xorg doesn't detect devices when boot_vga=0, e.g., on some EFI systems such as MacBookAir2,1. Xorg detects the GPU and finds the DRI device but then bails out with "no devices detected". Note: When vga_default_device() is set boot_vga PCI sysfs attribute reflects its state. When unset this attribute is 1 whenever IORESOURCE_ROM_SHADOW flag is set. With introduction of sysfb/simplefb/simpledrm efifb is getting obsolete while having native drivers for the GPU also makes selecting sysfb/efifb optional. Remove the efifb implementation of vga_default_device() and initialize vgaarb's vga_default_device() with the PCI GPU that matches boot screen_info in pci_fixup_video(). [bhelgaas: remove unused "dev" in efifb_setup()] Fixes: b4aa0163056b ("efifb: Implement vga_default_device() (v2)") Tested-by: Anibal Francisco Martinez Cortina <linuxkid.zeuz@gmail.com> Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Matthew Garrett <matthew.garrett@nebula.com> CC: stable@vger.kernel.org # v3.5+
2014-07-10	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6	Linus Torvalds
	Pull crypto fixes from Herbert Xu: "This push fixes an error in sha512_ssse3 that leads to incorrect output as well as a memory leak in caam_jr when the module is unloaded" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: caam - fix memleak in caam_jr module crypto: sha512_ssse3 - fix byte count to bit count conversion
2014-07-10	x86/efi: Include a .bss section within the PE/COFF headers	Michael Brown
	The PE/COFF headers currently describe only the initialised-data portions of the image, and result in no space being allocated for the uninitialised-data portions. Consequently, the EFI boot stub will end up overwriting unexpected areas of memory, with unpredictable results. Fix by including a .bss section in the PE/COFF headers (functionally equivalent to the init_size field in the bzImage header). Signed-off-by: Michael Brown <mbrown@fensystems.co.uk> Cc: Thomas Bächler <thomas@archlinux.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: <stable@vger.kernel.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-09	KVM: x86: Fix lapic.c debug prints	Nadav Amit
	In two cases lapic.c does not use the apic_debug macro correctly. This patch fixes them. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09	KVM: x86: fix TSC matching	Tomasz Grabiec
	I've observed kvmclock being marked as unstable on a modern single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0. The culprit was failure in TSC matching because of overflow of kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes in a single synchronization cycle. Turns out that qemu does multiple TSC writes during init, below is the evidence of that (qemu-2.0.0): The first one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm] 0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm] The second one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641 #4 cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330 #5 cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390 The third one: 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635 #4 cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323 #5 cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482 The fix is to count each vCPU only once when matched, so that nr_vcpus_matched_tsc holds the size of the matched set. This is achieved by reusing generation counters. Every vCPU with this_tsc_generation == cur_tsc_generation is in the matched set. The match set is cleared by setting cur_tsc_generation to a value which no other vCPU is set to (by incrementing it). I needed to bump up the counter size form u8 to u64 to ensure it never overflows. Otherwise in cases TSC is not written the same number of times on each vCPU the counter could overflow and incorrectly indicate some vCPUs as being in the matched set. This scenario seems unlikely but I'm not sure if it can be disregarded. Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09	KVM: nSVM: Set correct port for IOIO interception evaluation	Jan Kiszka
	Obtaining the port number from DX is bogus as a) there are immediate port accesses and b) user space may have changed the register content while processing the PIO access. Forward the correct value from the instruction emulator instead. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09	KVM: nSVM: Fix IOIO size reported on emulation	Jan Kiszka
	The access size of an in/ins is reported in dst_bytes, and that of out/outs in src_bytes. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09	KVM: nSVM: Fix IOIO bitmap evaluation	Jan Kiszka
	First, kvm_read_guest returns 0 on success. And then we need to take the access size into account when testing the bitmap: intercept if any of bits corresponding to the access is set. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09	KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1	Jan Kiszka
	CLTS only changes TS which is not monitored by selected CR0 interception. So skip any attempt to translate WRITE_CR0 to CR0_SEL_WRITE for this instruction. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-08	KVM: x86: Check for nested events if there is an injectable interrupt	Bandan Das
	With commit b6b8a1451fc40412c57d1 that introduced vmx_check_nested_events, checks for injectable interrupts happen at different points in time for L1 and L2 that could potentially cause a race. The regression occurs because KVM_REQ_EVENT is always set when nested_run_pending is set even if there's no pending interrupt. Consequently, there could be a small window when check_nested_events returns without exiting to L1, but an interrupt comes through soon after and it incorrectly, gets injected to L2 by inject_pending_event Fix this by adding a call to check for nested events too when a check for injectable interrupt returns true Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-07	efi: efistub: Refactor stub components	Ard Biesheuvel
	In order to move from the #include "../../../xxxxx.c" anti-pattern used by both the x86 and arm64 versions of the stub to a static library linked into either the kernel proper (arm64) or a separate boot executable (x86), there is some prepatory work required. This patch does the following: - move forward declarations of functions shared between the arch specific and the generic parts of the stub to include/linux/efi.h - move forward declarations of functions shared between various .c files of the generic stub code to a new local header file called "efistub.h" - add #includes to all .c files which were formerly relying on the #includor to include the correct header files - remove all static modifiers from functions which will need to be externally visible once we move to a static library Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-07	efi/x86: efistub: Move shared dependencies to <asm/efi.h>	Ard Biesheuvel
	This moves definitions depended upon both by code under arch/x86/boot and under drivers/firmware/efi to <asm/efi.h>. This is in preparation of turning the stub code under drivers/firmware/efi into a static library. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-07	efi/x86: Move UEFI Runtime Services wrappers to generic code	Ard Biesheuvel
	In order for other archs (such as arm64) to be able to reuse the virtual mode function call wrappers, move them to drivers/firmware/efi/runtime-wrappers.c. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-05	perf/x86: Micro-optimize nhmex_rbox_get_constraint()	Rasmus Villemoes
	Flipping the LSB doesn't require four lines of code. This shaves a few bytes of the generated code, including a branch. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403183731-15402-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-03	ptrace,x86: force IRET path after a ptrace_stop()	Tejun Heo
	The 'sysret' fastpath does not correctly restore even all regular registers, much less any segment registers or reflags values. That is very much part of why it's faster than 'iret'. Normally that isn't a problem, because the normal ptrace() interface catches the process using the signal handler infrastructure, which always returns with an iret. However, some paths can get caught using ptrace_event() instead of the signal path, and for those we need to make sure that we aren't going to return to user space using 'sysret'. Otherwise the modifications that may have been done to the register set by the tracer wouldn't necessarily take effect. Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from arch_ptrace_stop_needed() which is invoked from ptrace_stop(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-02	perf/x86/intel: ignore CondChgd bit to avoid false NMI handling	HATAYAMA Daisuke
	Currently, any NMI is falsely handled by a NMI handler of NMI watchdog if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set. For example, we use external NMI to make system panic to get crash dump, but in this case, the external NMI is falsely handled do to the issue. This commit deals with the issue simply by ignoring CondChgd bit. Here is explanation in detail. On x86 NMI watchdog uses performance monitoring feature to periodically signal NMI each time performance counter gets overflowed. intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI handler of NMI watchdog, perf_event_nmi_handler(). It identifies an owner of a given NMI by looking at overflow status bits in MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it handles the given NMI as its own NMI. The problem is that the intel_pmu_handle_irq() doesn't distinguish CondChgd bit from other bits. Unlike the other status bits, CondChgd bit doesn't represent overflow status for performance counters. Thus, CondChgd bit cannot be thought of as a mark indicating a given NMI is NMI watchdog's. As a result, if CondChgd bit is set, any NMI is falsely handled by the NMI handler of NMI watchdog. Also, if type of the falsely handled NMI is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding action is never performed until CondChgd bit is cleared. I noticed this behavior on systems with Ivy Bridge processors: Intel Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems, CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set in the beginning at boot. Then the CondChgd bit is immediately cleared by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0. On the other hand, on older processors such as Nehalem, Xeon E7540, CondChgd bit is not set in the beginning at boot. I'm not sure about exact behavior of CondChgd bit, in particular when this bit is set. Although I read Intel System Programmer's Manual to figure out that, the descriptions I found are: In 18.9.1: "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to indicate changes to the state of performancmonitoring hardware" In Table 35-2 IA-32 Architectural MSRs 63 CondChg: status bits of this register has changed. These are different from the bahviour I see on the actual system as I explained above. At least, I think ignoring CondChgd bit should be enough for NMI watchdog perspective. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-02	x86, tsc: Fix cpufreq lockup	Peter Zijlstra
	Mauro reported that his AMD X2 using the powernow-k8 cpufreq driver locked up when doing cpu hotplug. Because we called set_cyc2ns_scale() from the time_cpufreq_notifier() unconditionally, it gets called multiple times for each freq change, instead of only the once, when the tsc_khz value actually changes. Because it gets called more than once, we run out of cyc2ns data slots and stall, waiting for a free one, but because we're half way offline, there's no consumers to free slots. By placing the call inside the condition that actually changes tsc_khz we avoid superfluous calls and avoid the problem. Reported-by: Mauro <registosites@hotmail.com> Tested-by: Mauro <registosites@hotmail.com> Fixes: 20d1c86a5776 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs") Cc: <stable@vger.kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Bin Gao <bin.gao@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-01	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds
	Pull KVM fixes from Paolo Bonzini: "A bunch of one-liners (except the s390 one). The two more serious bugs ("KVM: SVM: Fix CPL export via SS.DPL" and "KVM: s390: add sie.h uapi header file to Kbuild and remove header dependency") were introduced in the 3.16 merge window" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Fix CPL export via SS.DPL KVM: s390: add sie.h uapi header file to Kbuild and remove header dependency MIPS: KVM: Fix memory leak on VCPU KVM: x86: preserve the high 32-bits of the PAT register kvm: fix wrong address when writing Hyper-V tsc page KVM: x86: Increase the number of fixed MTRR regs to 10
2014-07-01	tracing: Add trace_seq_buffer_ptr() helper function	Steven Rostedt (Red Hat)
	There's several locations in the kernel that open code the calculation of the next location in the trace_seq buffer. This is usually done with p->buffer + p->len Instead of having this open coded, supply a helper function in the header to do it for them. This function is called trace_seq_buffer_ptr(). Link: http://lkml.kernel.org/p/20140626220129.452783019@goodmis.org Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-01	ftrace: Optimize function graph to be called directly	Steven Rostedt (Red Hat)
	Function graph tracing is a bit different than the function tracers, as it is processed after either the ftrace_caller or ftrace_regs_caller and we only have one place to modify the jump to ftrace_graph_caller, the jump needs to happen after the restore of registeres. The function graph tracer is dependent on the function tracer, where even if the function graph tracing is going on by itself, the save and restore of registers is still done for function tracing regardless of if function tracing is happening, before it calls the function graph code. If there's no function tracing happening, it is possible to just call the function graph tracer directly, and avoid the wasted effort to save and restore regs for function tracing. This requires adding new flags to the dyn_ftrace records: FTRACE_FL_TRAMP FTRACE_FL_TRAMP_EN The first is set if the count for the record is one, and the ftrace_ops associated to that record has its own trampoline. That way the mcount code can call that trampoline directly. In the future, trampolines can be added to arbitrary ftrace_ops, where you can have two or more ftrace_ops registered to ftrace (like kprobes and perf) and if they are not tracing the same functions, then instead of doing a loop to check all registered ftrace_ops against their hashes, just call the ftrace_ops trampoline directly, which would call the registered ftrace_ops function directly. Without this patch perf showed: 0.05% hackbench [kernel.kallsyms] [k] ftrace_caller 0.05% hackbench [kernel.kallsyms] [k] arch_local_irq_save 0.05% hackbench [kernel.kallsyms] [k] native_sched_clock 0.04% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit 0.04% hackbench [kernel.kallsyms] [k] preempt_trace 0.04% hackbench [kernel.kallsyms] [k] prepare_ftrace_return 0.04% hackbench [kernel.kallsyms] [k] __this_cpu_preempt_check 0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller See that the ftrace_caller took up more time than the ftrace_graph_caller did. With this patch: 0.05% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit 0.04% hackbench [kernel.kallsyms] [k] call_filter_check_discard 0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller 0.04% hackbench [kernel.kallsyms] [k] sched_clock The ftrace_caller is no where to be found and ftrace_graph_caller still takes up the same percentage. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30	arch: x86: kvm: x86.c: Cleaning up variable is set more than once	Rickard Strandqvist
	A struct member variable is set to the same value more than once This was found using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-30	Merge commit '33b458d276bb' into kvm-next	Paolo Bonzini
	Fix bad x86 regression introduced during merge window.