Age | Commit message (Collapse) | Author |
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit:
3cded4179481 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
introduced a paravirt op with bool return type [*]
It turns out that the PVOP_CALL*() macros miscompile when rettype is
bool. Code that looked like:
83 ef 01 sub $0x1,%edi
ff 15 32 a0 d8 00 callq *0xd8a032(%rip) # ffffffff81e28120 <pv_lock_ops+0x20>
84 c0 test %al,%al
ended up looking like so after PVOP_CALL1() was applied:
83 ef 01 sub $0x1,%edi
48 63 ff movslq %edi,%rdi
ff 14 25 20 81 e2 81 callq *0xffffffff81e28120
48 85 c0 test %rax,%rax
Note how it tests the whole of %rax, even though a typical bool return
function only sets %al, like:
0f 95 c0 setne %al
c3 retq
This is because ____PVOP_CALL() does:
__ret = (rettype)__eax;
and while regular integer type casts truncate the result, a cast to
bool tests for any !0 value. Fix this by explicitly truncating to
sizeof(rettype) before casting.
[*] The actual bug should've been exposed in commit:
446f3dc8cc0a ("locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests")
but that didn't properly implement the paravirt call.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 3cded4179481 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
Link: http://lkml.kernel.org/r/20161208154349.346057680@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While chasing a regression I noticed we potentially patch the wrong
code in native_patch().
If we do not select the native code sequence, we must use the default
patcher, not fall-through the switch case.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Fixes: 3cded4179481 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
Link: http://lkml.kernel.org/r/20161208154349.270616999@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
An earlier patch allowed enabling PT and LBR at the same
time on Goldmont. However it also allowed enabling BTS and LBR
at the same time, which is still not supported. Fix this by
bypassing the check only for PT.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alexander.shishkin@intel.com
Cc: kan.liang@intel.com
Cc: <stable@vger.kernel.org>
Fixes: ccbebba4c6bf ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
Link: http://lkml.kernel.org/r/20161209001417.4713-1-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
ldt->size can never be negative. The helper functions take 'unsigned int'
arguments which are assigned from ldt->size. The related user space
user_desc struct member entry_number is unsigned as well.
But ldt->size itself and a few local variables which are related to
ldt->size are type 'int' which makes no sense whatsoever and results in
typecasts which make the eyes bleed.
Clean it up and convert everything which is related to ldt->size to
unsigned it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
|
|
My static checker complains that we put an upper bound on the "size"
argument but not a lower bound. The checker is not smart enough to know
the possible ranges of "old_mm->context.ldt->size" from
init_new_context_ldt() so it thinks maybe it could be negative.
Let's make it unsigned to silence the warning and future proof the code
a bit.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: kernel-janitors@vger.kernel.org
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161208105602.GA11382@elgon.mountain
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
One include less is always a good thing(tm). Good riddance.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-6-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Reorganize the E400 detection now that we have everything in place:
switch the CPUs to broadcast mode after the LAPIC has been initialized
and remove the facilities that were used previously on the idle path.
Unfortunately static_cpu_has_bug() cannpt be used in the E400 idle routine
because alternatives have been applied when the actual detection happens,
so the static switching does not take effect and the test will stay
false. Use boot_cpu_has_bug() instead which is definitely an improvement
over the RDMSR and the cpumask handling.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-5-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
AMD CPUs affected by the E400 erratum suffer from the issue that the
local APIC timer stops when the CPU goes into C1E. Unfortunately there
is no way to detect the affected CPUs on early boot. It's only possible
to determine the range of possibly affected CPUs from the family/model
range.
The actual decision whether to enter C1E and thus cause the bug is done
by the firmware and we need to detect that case late, after ACPI has
been initialized.
The current solution is to check in the idle routine whether the CPU is
affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the
K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is
affected and the system is switched into forced broadcast mode.
This is ineffective and on non-affected CPUs every entry to idle does
the extra RDMSR.
After doing some research it turns out that the bits are visible on the
boot CPU right after the ACPI subsystem is initialized in the early
boot process. So instead of polling for the bits in the idle loop, add
a detection function after acpi_subsystem_init() and check for the MSR
bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and
the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it
will stop in C1E state as well.
The switch to broadcast mode cannot be done at this point because the
boot CPU still uses HPET as a clockevent device and the local APIC timer
is not yet calibrated and installed. The switch to broadcast mode on the
affected CPUs needs to be done when the local APIC timer is actually set
up.
This allows to cleanup the amd_e400_idle() function in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
state) is a two step process:
- Selection of the E400 aware idle routine
- Detection whether the platform is affected
The idle routine selection happens for possibly affected CPUs depending on
family/model/stepping information. These range of CPUs is not necessarily
affected as the decision whether to enable the C1E feature is made by the
firmware. Unfortunately there is no way to query this at early boot.
The current implementation polls a MSR in the E400 aware idle routine to
detect whether the CPU is affected. This is inefficient on non affected
CPUs because every idle entry has to do the MSR read.
There is a better way to detect this before going idle for the first time
which requires to seperate the bug flags:
X86_BUG_AMD_E400 - Selects the E400 aware idle routine and
enables the detection
X86_BUG_AMD_APIC_C1E - Set when the platform is affected by E400
Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
bug bit to select the idle routine which currently does an unconditional
detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
to remove the MSR polling and simplify the handling of this misfeature.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Will be used in a later patch to set bug bits for bugs which need late
detection.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
to it
With new binutils, gcc may get smart with its optimization and change a jmp
from a 5 byte jump to a 2 byte one even though it was jumping to a global
function. But that global function existed within a 2 byte radius, and gcc
was able to optimize it. Unfortunately, that jump was also being modified
when function graph tracing begins. Since ftrace expected that jump to be 5
bytes, but it was only two, it overwrote code after the jump, causing a
crash.
This was fixed for x86_64 with commit 8329e818f149, with the same subject as
this commit, but nothing was done for x86_32.
Cc: stable@vger.kernel.org
Fixes: d61f82d06672 ("ftrace: use dynamic patching for updating mcount calls")
Reported-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Some tracepoints have a registration function that gets enabled when the
tracepoint is enabled. There may be cases that the registraction function
must fail (for example, can't allocate enough memory). In this case, the
tracepoint should also fail to register, otherwise the user would not know
why the tracepoint is not working.
Cc: David Howells <dhowells@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Implement show_options() callback for intel resource control filesystem
to expose the active mount options in /proc/
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/7dce7c1886ac9289442d254ea18322c92bd968da.1480717072.git.shli@fb.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
On systems with sufficiently large e820 tables, and several IOAPICs, it
is possible for the XENMEM_machine_memory_map callback (and its
counterpart, XENMEM_memory_map) to attempt to return an e820 table with
more than 128 entries. This callback adds entries to the BIOS-provided
e820 table to account for IOAPIC registers, which, on sufficiently large
systems, can result in an e820 table that is too large to copy back into
xen_e820_map.
This change simply increases the size of xen_e820_map to E820_X_MAX to
ensure that there is enough room to store the entire e820 map returned
from this callback.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
It's really not necessary to limit E820_X_MAX to 128 in the non-EFI
case. This commit drops E820_X_MAX's dependency on CONFIG_EFI, so that
E820_X_MAX is always at least slightly larger than E820MAX.
The real motivation behind this is actually to prevent some issues in
the Xen kernel, where the XENMEM_machine_memory_map hypercall can
produce an e820 map larger than 128 entries, even on systems where the
original e820 table was quite a bit smaller than that, depending on how
many IOAPICs are installed on the system.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header). It is
done by adding a new XDP helper bpf_xdp_adjust_head().
It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.
This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support. The driver can then decide
to error out during XDP_SETUP_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the new API to create and destroy the "kvm-pit" kthread
worker. The API hides some implementation details.
In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way
and stores task_struct into the worker structure.
kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.
This patch does not change the existing behavior except for
dynamically allocating struct kthread_worker and storing
only the pointer of this structure.
It is compile tested only because I did not find an easy
way how to run the code. Well, it should be pretty safe
given the nature of the change.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
- Expose all invalidation types to the L1
- Reject invvpid instruction, if L1 passed zero vpid value to single
context invalidations
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This commit adds missing host CR3 checks. Before entering guest mode, the value
of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
called to set the new CR3 value and check and load PDPTRs.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.
* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary
This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.
A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:
1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.
2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.
This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.
This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.
This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.
This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.
Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.
When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.
Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.
Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.
Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).
Signed-off-by: David Matlack <dmatlack@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The trap flag stays set until software clears it.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
We can't return both the pass/fail boolean for the vmcs and the upcoming
continue/exit-to-userspace boolean for skip_emulated_instruction out of
nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.
Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
they advance the IP to the following instruction, not when they a) succeed,
b) fail MSR validation or c) throw an exception. Add a separate call to
skip_emulated_instruction that will later not be converted to the variant
that checks the singlestep flag.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The functions being moved ahead of skip_emulated_instruction here don't
need updated IPs, and skipping the emulated instruction at the end will
make it easier to return its value.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The function is never called under PV guests, and only shows up
when MSI (or MSI-X) cannot be allocated. Convert the message
to include the error value.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a core dumping crash fix, a guess-unwinder regression fix,
plus three build warning fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind: Fix guess-unwinder regression
x86/build: Annotate die() with noreturn to fix build warning on clang
x86/platform/olpc: Fix resume handler build warning
x86/apic/uv: Silence a shift wrapping warning
x86/coredump: Always use user_regs_struct for compat_elf_gregset_t
|
|
I recently encountered wreckage because access_ok() was used where it
should not be, add an explicit WARN when access_ok() is used wrongly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.
Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.
However commit:
069e0c3c4058 ("perf/x86/intel: Support full width counting")
sets the sampling interval to the full period, not half.
Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.
With both these issues fixed, counter overflow functions correctly
again.
Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Tested-by: Liang, Kan <kan.liang@intel.com>
Tested-by: Odzioba, Lukasz <lukasz.odzioba@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org
Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The Knights Mill is enough close to Knights Landing so the path reuses
C-state residency support of the latter.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20161201000853.18260-1-piotr.luc@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Resuming from a suspend operation is showing a KASAN false positive
warning:
BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr ffff8803867d7878
Read of size 8 by task pm-suspend/7774
page:ffffea000e19f5c0 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x2ffff0000000000()
page dumped because: kasan: bad access detected
CPU: 0 PID: 7774 Comm: pm-suspend Tainted: G B 4.9.0-rc7+ #8
Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016
Call Trace:
dump_stack+0x63/0x82
kasan_report_error+0x4b4/0x4e0
? acpi_hw_read_port+0xd0/0x1ea
? kfree_const+0x22/0x30
? acpi_hw_validate_io_request+0x1a6/0x1a6
__asan_report_load8_noabort+0x61/0x70
? unwind_get_return_address+0x11d/0x130
unwind_get_return_address+0x11d/0x130
? unwind_next_frame+0x97/0xf0
__save_stack_trace+0x92/0x100
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
? save_stack_trace+0x1b/0x20
? save_stack+0x46/0xd0
? kasan_kmalloc+0xad/0xe0
? kasan_slab_alloc+0x12/0x20
? acpi_hw_read+0x2b6/0x3aa
? acpi_hw_validate_register+0x20b/0x20b
? acpi_hw_write_port+0x72/0xc7
? acpi_hw_write+0x11f/0x15f
? acpi_hw_read_multiple+0x19f/0x19f
? memcpy+0x45/0x50
? acpi_hw_write_port+0x72/0xc7
? acpi_hw_write+0x11f/0x15f
? acpi_hw_read_multiple+0x19f/0x19f
? kasan_unpoison_shadow+0x36/0x50
kasan_kmalloc+0xad/0xe0
kasan_slab_alloc+0x12/0x20
kmem_cache_alloc_trace+0xbc/0x1e0
? acpi_get_sleep_type_data+0x9a/0x578
acpi_get_sleep_type_data+0x9a/0x578
acpi_hw_legacy_wake_prep+0x88/0x22c
? acpi_hw_legacy_sleep+0x3c7/0x3c7
? acpi_write_bit_register+0x28d/0x2d3
? acpi_read_bit_register+0x19b/0x19b
acpi_hw_sleep_dispatch+0xb5/0xba
acpi_leave_sleep_state_prep+0x17/0x19
acpi_suspend_enter+0x154/0x1e0
? trace_suspend_resume+0xe8/0xe8
suspend_devices_and_enter+0xb09/0xdb0
? printk+0xa8/0xd8
? arch_suspend_enable_irqs+0x20/0x20
? try_to_freeze_tasks+0x295/0x600
pm_suspend+0x6c9/0x780
? finish_wait+0x1f0/0x1f0
? suspend_devices_and_enter+0xdb0/0xdb0
state_store+0xa2/0x120
? kobj_attr_show+0x60/0x60
kobj_attr_store+0x36/0x70
sysfs_kf_write+0x131/0x200
kernfs_fop_write+0x295/0x3f0
__vfs_write+0xef/0x760
? handle_mm_fault+0x1346/0x35e0
? do_iter_readv_writev+0x660/0x660
? __pmd_alloc+0x310/0x310
? do_lock_file_wait+0x1e0/0x1e0
? apparmor_file_permission+0x18/0x20
? security_file_permission+0x73/0x1c0
? rw_verify_area+0xbd/0x2b0
vfs_write+0x149/0x4a0
SyS_write+0xd9/0x1c0
? SyS_read+0x1c0/0x1c0
entry_SYSCALL_64_fastpath+0x1e/0xad
Memory state around the buggy address:
ffff8803867d7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8803867d7780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8803867d7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f4
^
ffff8803867d7880: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
ffff8803867d7900: 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f3 f3 f3 f3 00
KASAN instrumentation poisons the stack when entering a function and
unpoisons it when exiting the function. However, in the suspend path,
some functions never return, so their stack never gets unpoisoned,
resulting in stale KASAN shadow data which can cause later false
positive warnings like the one above.
Reported-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Linux 4.9-rc8
Daniel requested this so we could apply some follow on fixes cleanly to -next.
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
intel_rdt_sched_in() must be called with preemption disabled because the
function accesses percpu variables (pqr_state and closid).
If a task moves itself via move_myself() preemption is enabled, which
violates the calling convention and can result in incorrect closid
selection when the task gets preempted or migrated.
Add the required protection and a comment about the calling convention.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Marcelo Tosatti" <mtosatti@redhat.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1480625714-54246-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This patch provides APEI arch-specific bits for ARM64
Meanwhile,
(1) Move HEST type (ACPI_HEST_TYPE_IA32_CORRECTED_CHECK) checking to
a generic place.
(2) Select HAVE_ACPI_APEI when EFI and ACPI is set on ARM64, because
arch_apei_get_mem_attribute is using efi_mem_attributes() on
ARM64.
Signed-off-by: Tomasz Nowicki <tomasz.nowicki@linaro.org>
Tested-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Fu Wei <fu.wei@linaro.org>
[ Fu Wei: improve && upstream ]
Acked-by: Hanjun Guo <hanjun.guo@linaro.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
0-day testing encountered a NULL pointer dereference in a cpumask access
from tsc_store_and_check_tsc_adjust().
This happens when the function is called on the boot CPU and the topology
masks are not yet available due to CPUMASK_OFFSTACK=y.
Add a NULL pointer check for the mask pointer. If NULL it's safe to assume
that the CPU is the boot CPU and the first one in the package.
Fixes: 8b223bc7abe0 ("x86/tsc: Store and check TSC ADJUST MSR")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.
In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Add the missing return statement to the inline stub
tsc_store_and_check_tsc_adjust() and add the other stubs to make a
SMP=y,TSC=n build happy.
While at it, remove the unused variable from the UP variant of
tsc_store_and_check_tsc_adjust().
Fixes: commit ba75fb646931 ("x86/tsc: Sync test only for the first cpu in a package")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Right now CONFIG_SCHED_MC_PRIO has X86_INTEL_PSTATE as a dependency,
which is not enabled by default and which hides the CONFIG_SCHED_MC_PRIO
hardware-enabling feature.
Select X86_INTEL_PSTATE instead, plus its dependency (CPU_FREQ), if the
user enables CONFIG_SCHED_MC_PRIO=y.
(Also align the CONFIG_SCHED_MC_PRIO Kconfig help text in standard style.)
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: rjw@rjwysocki.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Rename CONFIG_SCHED_ITMT for Intel Turbo Boost Max Technology 3.0
to CONFIG_SCHED_MC_PRIO. This makes the configuration extensible
in future to other architectures that wish to similarly establish
CPU core priorities support in the scheduler.
The description in Kconfig is updated to reflect this change with
added details for better clarity. The configuration is explicitly
default-y, to enable the feature on CPUs that have this feature.
It has no effect on non-TBM3 CPUs.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/2b2ee29d93e3f162922d72d0165a1405864fbb23.1480444902.git.tim.c.chen@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|