summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2016-11-13x86/efi: Fix EFI memmap pointer size warningBorislav Petkov
Fix this when building on 32-bit: arch/x86/platform/efi/efi.c: In function ‘__efi_enter_virtual_mode’: arch/x86/platform/efi/efi.c:911:5: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] (efi_memory_desc_t *)pa); ^ arch/x86/platform/efi/efi.c:918:5: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] (efi_memory_desc_t *)pa); ^ The @pa local variable is declared as phys_addr_t and that is a u64 when CONFIG_PHYS_ADDR_T_64BIT=y. (The last is enabled on 32-bit on a PAE build.) However, its value comes from __pa() which is basically doing pointer arithmetic and checking, and returns unsigned long as it is the native pointer width. So let's use an unsigned long too. It should be fine to do so because the later users cast it to a pointer too. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20161112210424.5157-2-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-13x86/efi: Retrieve and assign Apple device propertiesLukas Wunner
Apple's EFI drivers supply device properties which are needed to support Macs optimally. They contain vital information which cannot be obtained any other way (e.g. Thunderbolt Device ROM). They're also used to convey the current device state so that OS drivers can pick up where EFI drivers left (e.g. GPU mode setting). There's an EFI driver dubbed "AAPL,PathProperties" which implements a per-device key/value store. Other EFI drivers populate it using a custom protocol. The macOS bootloader /System/Library/CoreServices/boot.efi retrieves the properties with the same protocol. The kernel extension AppleACPIPlatform.kext subsequently merges them into the I/O Kit registry (see ioreg(8)) where they can be queried by other kernel extensions and user space. This commit extends the efistub to retrieve the device properties before ExitBootServices is called. It assigns them to devices in an fs_initcall so that they can be queried with the API in <linux/property.h>. Note that the device properties will only be available if the kernel is booted with the efistub. Distros should adjust their installers to always use the efistub on Macs. grub with the "linux" directive will not work unless the functionality of this commit is duplicated in grub. (The "linuxefi" directive should work but is not included upstream as of this writing.) The custom protocol has GUID 91BD12FE-F6C3-44FB-A5B7-5122AB303AE0 and looks like this: typedef struct { unsigned long version; /* 0x10000 */ efi_status_t (*get) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name, OUT void *buffer, IN OUT u32 *buffer_len); /* EFI_SUCCESS, EFI_NOT_FOUND, EFI_BUFFER_TOO_SMALL */ efi_status_t (*set) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name, IN void *property_value, IN u32 property_value_len); /* allocates copies of property name and value */ /* EFI_SUCCESS, EFI_OUT_OF_RESOURCES */ efi_status_t (*del) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name); /* EFI_SUCCESS, EFI_NOT_FOUND */ efi_status_t (*get_all) ( IN struct apple_properties_protocol *this, OUT void *buffer, IN OUT u32 *buffer_len); /* EFI_SUCCESS, EFI_BUFFER_TOO_SMALL */ } apple_properties_protocol; Thanks to Pedro Vilaça for this blog post which was helpful in reverse engineering Apple's EFI drivers and bootloader: https://reverse.put.as/2016/06/25/apple-efi-firmware-passwords-and-the-scbo-myth/ If someone at Apple is reading this, please note there's a memory leak in your implementation of the del() function as the property struct is freed but the name and value allocations are not. Neither the macOS bootloader nor Apple's EFI drivers check the protocol version, but we do to avoid breakage if it's ever changed. It's been the same since at least OS X 10.6 (2009). The get_all() function conveniently fills a buffer with all properties in marshalled form which can be passed to the kernel as a setup_data payload. The number of device properties is dynamic and can change between a first invocation of get_all() (to determine the buffer size) and a second invocation (to retrieve the actual buffer), hence the peculiar loop which does not finish until the buffer size settles. The macOS bootloader does the same. The setup_data payload is later on unmarshalled in an fs_initcall. The idea is that most buses instantiate devices in "subsys" initcall level and drivers are usually bound to these devices in "device" initcall level, so we assign the properties in-between, i.e. in "fs" initcall level. This assumes that devices to which properties pertain are instantiated from a "subsys" initcall or earlier. That should always be the case since on macOS, AppleACPIPlatformExpert::matchEFIDevicePath() only supports ACPI and PCI nodes and we've fully scanned those buses during "subsys" initcall level. The second assumption is that properties are only needed from a "device" initcall or later. Seems reasonable to me, but should this ever not work out, an alternative approach would be to store the property sets e.g. in a btree early during boot. Then whenever device_add() is called, an EFI Device Path would have to be constructed for the newly added device, and looked up in the btree. That way, the property set could be assigned to the device immediately on instantiation. And this would also work for devices instantiated in a deferred fashion. It seems like this approach would be more complicated and require more code. That doesn't seem justified without a specific use case. For comparison, the strategy on macOS is to assign properties to objects in the ACPI namespace (AppleACPIPlatformExpert::mergeEFIProperties()). That approach is definitely wrong as it fails for devices not present in the namespace: The NHI EFI driver supplies properties for attached Thunderbolt devices, yet on Macs with Thunderbolt 1 only one device level behind the host controller is described in the namespace. Consequently macOS cannot assign properties for chained devices. With Thunderbolt 2 they started to describe three device levels behind host controllers in the namespace but this grossly inflates the SSDT and still fails if the user daisy-chained more than three devices. We copy the property names and values from the setup_data payload to swappable virtual memory and afterwards make the payload available to the page allocator. This is just for the sake of good housekeeping, it wouldn't occupy a meaningful amount of physical memory (4444 bytes on my machine). Only the payload is freed, not the setup_data header since otherwise we'd break the list linkage and we cannot safely update the predecessor's ->next link because there's no locking for the list. The payload is currently not passed on to kexec'ed kernels, same for PCI ROMs retrieved by setup_efi_pci(). This can be added later if there is demand by amending setup_efi_state(). The payload can then no longer be made available to the page allocator of course. Tested-by: Lukas Wunner <lukas@wunner.de> [MacBookPro9,1] Tested-by: Pierre Moreau <pierre.morrow@free.fr> [MacBookPro11,3] Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andreas Noever <andreas.noever@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pedro Vilaça <reverser@put.as> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: grub-devel@gnu.org Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20161112213237.8804-9-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-13efi: Allow bitness-agnostic protocol callsLukas Wunner
We already have a macro to invoke boot services which on x86 adapts automatically to the bitness of the EFI firmware: efi_call_early(). The macro allows sharing of functions across arches and bitness variants as long as those functions only call boot services. However in practice functions in the EFI stub contain a mix of boot services calls and protocol calls. Add an efi_call_proto() macro for bitness-agnostic protocol calls to allow sharing more code across arches as well as deduplicating 32 bit and 64 bit code paths. On x86, implement it using a new efi_table_attr() macro for bitness- agnostic table lookups. Refactor efi_call_early() to make use of the same macro. (The resulting object code remains identical.) Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andreas Noever <andreas.noever@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20161112213237.8804-8-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11crypto: aesni: shut up -Wmaybe-uninitialized warningArnd Bergmann
The rfc4106 encrypy/decrypt helper functions cause an annoying false-positive warning in allmodconfig if we turn on -Wmaybe-uninitialized warnings again: arch/x86/crypto/aesni-intel_glue.c: In function ‘helper_rfc4106_decrypt’: include/linux/scatterlist.h:67:31: warning: ‘dst_sg_walk.sg’ may be used uninitialized in this function [-Wmaybe-uninitialized] The problem seems to be that the compiler doesn't track the state of the 'one_entry_in_sg' variable across the kernel_fpu_begin/kernel_fpu_end section. This takes the easy way out by adding a bogus initialization, which should be harmless enough to get the patch into v4.9 so we can turn on this warning again by default without producing useless output. A follow-up patch for v4.10 rearranges the code to make the warning go away. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11x86: apm: avoid uninitialized dataArnd Bergmann
apm_bios_call() can fail, and return a status in its argument structure. If that status however is zero during a call from apm_get_power_status(), we end up using data that may have never been set, as reported by "gcc -Wmaybe-uninitialized": arch/x86/kernel/apm_32.c: In function ‘apm’: arch/x86/kernel/apm_32.c:1729:17: error: ‘bx’ may be used uninitialized in this function [-Werror=maybe-uninitialized] arch/x86/kernel/apm_32.c:1835:5: error: ‘cx’ may be used uninitialized in this function [-Werror=maybe-uninitialized] arch/x86/kernel/apm_32.c:1730:17: note: ‘cx’ was declared here arch/x86/kernel/apm_32.c:1842:27: error: ‘dx’ may be used uninitialized in this function [-Werror=maybe-uninitialized] arch/x86/kernel/apm_32.c:1731:17: note: ‘dx’ was declared here This changes the function to return "APM_NO_ERROR" here, which makes the code more robust to broken BIOS versions, and avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11perf/x86/intel/uncore: Add more Intel uncore IMC PCI IDs for SkyLakeKan Liang
Several uncore IMC PCI IDs are missed for Intel SkyLake. Add the PCI IDs for SkyLake Y, U, H and S platforms. Rename the ID macros for 0x191f and 0x190c. The corresponding bug: https://bugzilla.kernel.org/show_bug.cgi?id=187301 The related datasheets are also attached in the bug entry for permanent reference. Reported-by: Ben Widawsky <benjamin.widawsky@intel.com> Tested-by: Ben Widawsky <benjamin.widawsky@intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Widawsky <benjamin.widawsky@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1478631281-5061-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11x86/mce/AMD: Fix HWID_MCATYPE calculation by grouping argumentsYazen Ghannam
The calculation of the hwid_mcatype value in get_smca_bank_info() became incorrect after applying the following commit: 1ce9cd7f9f0b ("x86/RAS: Simplify SMCA HWID descriptor struct") This causes the function to not match a bank to its type. Disassembly of hwid_mcatype calculation after change: db: 8b 45 e0 mov -0x20(%rbp),%eax de: 41 89 c4 mov %eax,%r12d e1: 25 00 00 ff 0f and $0xfff0000,%eax e6: 41 c1 ec 10 shr $0x10,%r12d ea: 41 09 c4 or %eax,%r12d Disassembly of hwid_mcatype calculation in original code: 286: 8b 45 d0 mov -0x30(%rbp),%eax 289: 41 89 c5 mov %eax,%r13d 28c: c1 e8 10 shr $0x10,%eax 28f: 41 81 e5 ff 0f 00 00 and $0xfff,%r13d 296: 41 c1 e5 10 shl $0x10,%r13d 29a: 41 09 c5 or %eax,%r13d Grouping the arguments to the HWID_MCATYPE() macro fixes the issue. ( Boris suggested adding parentheses in the macro. ) Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11x86/MCE: Correct TSC timestamping of error recordsBorislav Petkov
We did have logic in the MCE code which would TSC-timestamp an error record only when it is exact - i.e., when it wasn't detected by polling. This isn't the case anymore. So let's fix that: We have a valid TSC timestamp in the error record only when it has been a precise detection, i.e., either in the #MC handler or in one of the interrupt handlers (thresholding, deferred, ...). All other error records still have mce.time which contains the wall time in order to be able to place the error record in time at least approximately. Also, this fixes another bug where machine_check_poll() would clear mce.tsc unconditionally even if we requested precise MCP_TIMESTAMP logging. The proper fix would be to generate timestamp only when it has been requested and not always. But that would require a more thorough code audit of all mce_gather_info/mce_setup() users. Add a FIXME for now. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony <tony.luck@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkp@01.org Link: http://lkml.kernel.org/r/20161110131053.kybsijfs5venpjnf@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-10drivers: base: cacheinfo: fix x86 with CONFIG_OF enabledSudeep Holla
With CONFIG_OF enabled on x86, we get the following error on boot: " Failed to find cpu0 device node Unable to detect cache hierarchy from DT for CPU 0 " and the cacheinfo fails to get populated in the corresponding sysfs entries. This is because cache_setup_of_node looks for of_node for setting up the shared cpu_map without checking that it's already populated in the architecture specific callback. In order to indicate that the shared cpu_map is already populated, this patch introduces a boolean `cpu_map_populated` in struct cpu_cacheinfo that can be used by the generic code to skip cache_shared_cpu_map_setup. This patch also sets that boolean for x86. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-09x86/kexec: add -fno-PIESebastian Andrzej Siewior
If the gcc is configured to do -fPIE by default then the build aborts later with: | Unsupported relocation type: unknown type rel type name (29) Tagging it stable so it is possible to compile recent stable kernels as well. Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Michal Marek <mmarek@suse.com>
2016-11-09x86/apic: Prevent tracing on apic_msr_write_eoi()Wanpeng Li
The following RCU lockdep warning led to adding irq_enter()/irq_exit() into smp_reschedule_interrupt(): RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! no locks held by swapper/1/0. do_trace_write_msr native_write_msr native_apic_msr_eoi_write smp_reschedule_interrupt reschedule_interrupt As Peterz pointed out: | So now we're making a very frequent interrupt slower because of debug | code. | | The thing is, many many smp_reschedule_interrupt() invocations don't | actually execute anything much at all and are only sent to tickle the | return to user path (which does the actual preemption). | | Having to do the whole irq_enter/irq_exit dance just for this unlikely | debug case totally blows. Use the wrmsr_notrace() variant in native_apic_msr_write_eoi, annotate the kvm variant with notrace and add a native_apic_eoi callback to the apic structure so KVM guests are covered as well. This allows to revert the irq_enter/irq_exit dance in smp_reschedule_interrupt(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: Mike Galbraith <efault@gmx.de> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1478488420-5982-3-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09x86/msr: Add wrmsr_notrace()Wanpeng Li
Required to remove the extra irq_enter()/irq_exit() in smp_reschedule_interrupt(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: kvm@vger.kernel.org Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1478488420-5982-2-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09x86/pat, mm: Make track_pfn_insert() return voidBorislav Petkov
It only returns 0 so we can save us the testing of its retval everywhere. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: mcgrof@suse.com Cc: dri-devel@lists.freedesktop.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Airlie <airlied@redhat.com> Cc: dan.j.williams@intel.com Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20161026174839.rusfxkm3xt4ennhe@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09x86/cpu: Deal with broken firmware (VMWare/XEN)Thomas Gleixner
Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. The physical package id which is used after a CPU comes up is retrieved from CPUID. So we rely on ACPI/MP tables and CPUID agreeing in that respect. There exist VMware and XEN implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings. The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. Fix up the initial APIC id, which is fine as it is only used printout purposes. If the physical package IDs differ yell and use the package information from the ACPI/MP tables so the existing logical package map just works. Chas provided the resulting dmesg output for his affected 4 virtual sockets, 1 core per socket VM: [Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 CPUID: 2 [Firmware Bug]: CPU1: Using firmware package id 1 instead of 2 .... Reported-and-tested-by: "Charles (Chas) Williams" <ciwillia@brocade.com>, Reported-by: M. Vefa Bicakci <m.v.b@runbox.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Alok Kataria <akataria@vmware.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: #4.6+ <stable@vger,kernel.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611091613540.3501@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09x86/cpu/AMD: Clean up cpu_llc_id assignment per topology featureYazen Ghannam
These changes do not affect current hw - just a cleanup: Currently, we assume that a system has a single Last Level Cache (LLC) per node, and that the cpu_llc_id is thus equal to the node_id. This no longer applies since Fam17h can have multiple last level caches within a node. So group the cpu_llc_id assignment by topology feature and family in order to make the computation of cpu_llc_id on the different families more clear. Here is how the LLC ID is being computed on the different families: The NODEID_MSR feature only applies to Fam10h in which case the LLC is at the node level. The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only see multiple last level caches if L3 caches are available. Otherwise, the cpu_llc_id will default to be the phys_proc_id. We have L3 caches only on families 15h and 17h: - on Fam15h, the LLC is at the node level. - on Fam17h, the LLC is at the core complex level and can be found by right shifting the APIC ID. Also, keep the family checks explicit so that new families will fall back to the default, which will be node_id for TOPOEXT systems. Single node systems in families 10h and 15h will have a Node ID of 0 which will be the same as the phys_proc_id, so we don't need to check for multiple nodes before using the node_id. Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> [ Rewrote the commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20161108153054.bs3sajbyevq6a6uu@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-09Merge branch 'x86/urgent' into x86/cpu, to pick up dependent fixIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-09x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systemsYazen Ghannam
cpu_llc_id (Last Level Cache ID) derivation on AMD Fam17h has an underflow bug when extracting the socket_id value. It starts from 0 so subtracting 1 from it will result in an invalid value. This breaks scheduling topology later on since the cpu_llc_id will be incorrect. For example, the the cpu_llc_id of the *other* CPU in the loops in set_cpu_sibling_map() underflows and we're generating the funniest thread_siblings masks and then when I run 8 threads of nbench, they get spread around the LLC domains in a very strange pattern which doesn't give you the normal scheduling spread one would expect for performance. Other things like EDAC use cpu_llc_id so they will be b0rked too. So, the APIC ID is preset in APICx020 for bits 3 and above: they contain the core complex, node and socket IDs. The LLC is at the core complex level so we can find a unique cpu_llc_id by right shifting the APICID by 3 because then the least significant bit will be the Core Complex ID. Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> [ Cleaned up and extended the commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # v4.4.. Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 3849e91f571d ("x86/AMD: Fix last level cache topology for AMD Fam17h systems") Link: http://lkml.kernel.org/r/20161108083506.rvqb5h4chrcptj7d@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-09Merge tag 'tags/for-kvmgt' into HEADPaolo Bonzini
The three KVM patches that KVMGT needs. Conflicts: arch/x86/include/asm/kvm_page_track.h arch/x86/kvm/mmu.c
2016-11-08x86/RAS: Hide SMCA bank namesBorislav Petkov
Add accessor functions and hide the smca_names array. Also, add a sanity-check to bank HWID assignment in get_smca_bank_info(). Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Rename smca_bank_names to smca_namesBorislav Petkov
Make it differ more from struct smca_bank_name for better readability. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Simplify SMCA HWID descriptor structBorislav Petkov
Call it simply smca_hwid and call local variables "hwid". More readable. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Simplify SMCA bank descriptor structBorislav Petkov
Call the struct simply smca_bank, it's instance ID can be simply ->id. Makes the code much more readable. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-1-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/MCE: Dump MCE to dmesg if no consumersBorislav Petkov
When there are no error record consumers registered with the kernel, the only thing that appears in dmesg is something like: [ 300.000326] mce: [Hardware Error]: Machine check events logged and the error records are gone. Which is seriously counterproductive. So let's dump them to dmesg instead, in such a case. Requested-by: Eric Morton <Eric.Morton@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20161101120911.13163-4-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Add TSC timestamp to the injected MCEBorislav Petkov
The MCE injection code does not provide the time stamp information for the injected MCE. Add it. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20161101120911.13163-3-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/MCE: Do not look at panic_on_oops in the severity gradingYinghai Lu
The MCE tolerance levels control whether we panic on a machine check or do something else like generating a signal and logging error information. This is controlled by the mce=<level> command line parameter. However, if panic_on_oops is set, it will force a panic for such an MCE even though the user didn't want to. So don't check panic_on_oops in the severity grading anymore. One of the use cases for that is recovery from uncorrectable errors with mce=2. [ Boris: rewrite commit message. ] Signed-off-by: Yinghai Lu <yinghai.lu@oracle.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20160916202325.4972-1-yinghai@kernel.org Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping functionAlexander Duyck
The mapping function should always return DMA_ERROR_CODE when a mapping has failed as this is what the DMA API expects when a DMA error has occurred. The current function for mapping a page in Xen was returning either DMA_ERROR_CODE or 0 depending on where it failed. On x86 DMA_ERROR_CODE is 0, but on other architectures such as ARM it is ~0. We need to make sure we return the same error value if either the mapping failed or the device is not capable of accessing the mapping. If we are returning DMA_ERROR_CODE as our error value we can drop the function for checking the error code as the default is to compare the return value against DMA_ERROR_CODE if no function is defined. Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
2016-11-07x86/platform/intel-mid: Retrofit pci_platform_pm_ops ->get_state hookLukas Wunner
Commit cc7cc02bada8 ("PCI: Query platform firmware for device power state") augmented struct pci_platform_pm_ops with a ->get_state hook and implemented it for acpi_pci_platform_pm, the only pci_platform_pm_ops existing till v4.7. However v4.8 introduced another pci_platform_pm_ops for Intel Mobile Internet Devices with commit 5823d0893ec2 ("x86/platform/intel-mid: Add Power Management Unit driver"). It is missing the ->get_state hook, which is fatal since pci_set_platform_pm() enforces its presence. Andy Shevchenko reports that without the present commit, such a device "crashes without even a character printed out on serial console and reboots (since watchdog)". Retrofit mid_pci_platform_pm with the missing callback to fix the breakage. Acked-and-tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Fixes: cc7cc02bada8 ("PCI: Query platform firmware for device power state") Signed-off-by: Lukas Wunner <lukas@wunner.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: http://lkml.kernel.org/r/7c1567d4c49303a4aada94ba16275cbf56b8976b.1477221514.git.lukas@wunner.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07x86/intel_rdt: Export the minimum number of set mask bits in sysfsShaohua Li
The minimum number of bits set for a cache mask is checked by the kernel when writing a mask, but there is no way for the user to retrieve this information. Add a new file 'min_cbm_bits' to the info directory and export the information to user space. [ tglx: Massaged changelog ] Signed-off-by: Shaohua Li <shli@fb.com> Cc: fenghua.yu@intel.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/e69b1ffa206d0353eea58101e1bf9b677d9732f7.1478207143.git.shli@fb.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07x86/intel_rdt: Propagate error in rdt_mount() properlyShaohua Li
gcc complains: "warning: ‘dentry’ may be used uninitialized in this function" The error exit path in rdt_mount(), which deals with a failure in rdtgroup_create_info_dir(), does not set the error code in dentry and returns the uninitialized dentry value. Add the missing error propagation. [tglx: Massaged changelog ] Fixes: 4e978d06dedb ("x86/intel_rdt: Add "info" files to resctrl file system") Signed-off-by: Shaohua Li <shli@fb.com> Cc: fenghua.yu@intel.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/a56a556f6768dc12cadbf97f49e000189056f90e.1478207143.git.shli@fb.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07x86/intel_rdt: Add a missing #includeBorislav Petkov
... to fix this build warning: In file included from arch/x86/kernel/cpu/intel_rdt.c:33:0: ./arch/x86/include/asm/intel_rdt.h:56:25: warning: ‘struct kernfs_open_file’ declared inside parameter list will not be visible outside of this definition or declaration int (*seq_show)(struct kernfs_open_file *of, ^~~~~~~~~~~~~~~~ ... Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <h.peter.anvin@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nilay Vaish <nilayvaish@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi V Shankar <ravi.v.shankar@intel.com> Cc: Sai Prakhya <sai.praneeth.prakhya@intel.com> Cc: Shaohua Li <shli@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: http://lkml.kernel.org/r/20161102165117.23545-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-07x86/boot: Simplify the GDTR calculation assembly code a bitWei Yang
This patch calculates the GDTR's base address via a single instruction. ( EBP contains the address where it is loaded and GDTR's base address is already set to "gdt" in compilation. It is fine to get the correct base address by adding the delta to GDTR's base address. ) Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1478015364-5547-1-git-send-email-richard.weiyang@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-07x86/boot/build: Remove always empty $(USERINCLUDE)Paul Bolle
Commmit b6eea87fc685: ("x86, boot: Explicitly include autoconf.h for hostprogs") correctly noted: [...] that because $(USERINCLUDE) isn't exported by the top-level Makefile it's actually empty in arch/x86/boot/Makefile. So let's do the sane thing and remove the reference to that make variable. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: David Howells <dhowells@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1478166468-9760-1-git-send-email-pebolle@tiscali.nl Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "One NULL pointer dereference, and two fixes for regressions introduced during the merge window. The rest are fixes for MIPS, s390 and nested VMX" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: x86: Check memopp before dereference (CVE-2016-8630) kvm: nVMX: VMCLEAR an active shadow VMCS after last use KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK KVM: x86: fix wbinvd_dirty_mask use-after-free kvm/x86: Show WRMSR data is in hex kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types KVM: document lock orders KVM: fix OOPS on flush_work KVM: s390: Fix STHYI buffer alignment for diag224 KVM: MIPS: Precalculate MMIO load resume PC KVM: MIPS: Make ERET handle ERL before EXL KVM: MIPS: Fix lazy user ASID regenerate for SMP
2016-11-04kvm/page_track: export symbols for external usageJike Song
Signed-off-by: Jike Song <jike.song@intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04kvm/page_track: call notifiers with kvm_page_track_notifier_nodeJike Song
The user of page_track might needs extra information, so pass the kvm_page_track_notifier_node to callbacks. Signed-off-by: Jike Song <jike.song@intel.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04KVM: x86: add track_flush_slot page track notifierXiaoguang Chen
When a memory slot is being moved or removed users of page track can be notified. So users can drop write-protection for the pages in that memory slot. This notifier type is needed by KVMGT to sync up its shadow page table when memory slot is being moved or removed. Register the notifier type track_flush_slot to receive memslot move and remove event. Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com> Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com> [Squashed commits to avoid bisection breakage and reworded the subject.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-03kvm: x86: avoid atomic operations on APICv vmentryPaolo Bonzini
On some benchmarks (e.g. netperf with ioeventfd disabled), APICv posted interrupts turn out to be slower than interrupt injection via KVM_REQ_EVENT. This patch optimizes a bit the IRR update, avoiding expensive atomic operations in the common case where PI.ON=0 at vmentry or the PIR vector is mostly zero. This saves at least 20 cycles (1%) per vmexit, as measured by kvm-unit-tests' inl_from_qemu test (20 runs): | enable_apicv=1 | enable_apicv=0 | mean stdev | mean stdev ----------|-----------------|------------------ before | 5826 32.65 | 5765 47.09 after | 5809 43.42 | 5777 77.02 Of course, any change in the right column is just placebo effect. :) The savings are bigger if interrupts are frequent. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02KVM: nVMX: support descriptor table exitsPaolo Bonzini
These are never used by the host, but they can still be reflected to the guest. Tested-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02KVM: x86: use ktime_get instead of seeking the hrtimer_clock_basePaolo Bonzini
The base clock for the LAPIC timer is always CLOCK_MONOTONIC. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer supportWanpeng Li
Most windows guests still utilize APIC Timer periodic/oneshot mode instead of tsc-deadline mode, and the APIC Timer periodic/oneshot mode are still emulated by high overhead hrtimer on host. This patch converts the expected expire time of the periodic/oneshot mode to guest deadline tsc in order to leverage VMX preemption timer logic for APIC Timer tsc-deadline mode. After each preemption timer vmexit preemption timer is restarted to emulate LVTT current-count register is automatically reloaded from the initial-count register when the count reaches 0. This patch reduces ~5600 cycles for each APIC Timer periodic mode operation virtualization. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Squashed with my fixes that were reviewed-by Paolo.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: LAPIC: rename start/cancel_hv_tscdeadline to start/cancel_hv_timerWanpeng Li
Rename start/cancel_hv_tscdeadline to start/cancel_hv_timer since they will handle both APIC Timer periodic/oneshot mode and tsc-deadline mode. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: LAPIC: introduce kvm_get_lapic_target_expiration_tsc()Wanpeng Li
Introdce kvm_get_lapic_target_expiration_tsc() to get APIC Timer target deadline tsc. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: LAPIC: guarantee the timer is in tsc-deadline modeWanpeng Li
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot() and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when rdmsr MSR_IA32_TSCDEADLINE. Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: LAPIC: extract start_sw_period() to handle periodic/oneshot modeWanpeng Li
Extract start_sw_period() to handle periodic/oneshot mode, it will be used by later patch. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02kvm: x86: remove the misleading comment in vmx_handle_external_intrLongpeng(Mike)
Since Paolo has removed irq-enable-operation in vmx_handle_external_intr (KVM: x86: use guest_exit_irqoff), the original comment about the IF bit in rflags is incorrect and stale now, so remove it. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: x86: add track_flush_slot page track notifierXiaoguang Chen
When a memory slot is being moved or removed users of page track can be notified. So users can drop write-protection for the pages in that memory slot. This notifier type is needed by KVMGT to sync up its shadow page table when memory slot is being moved or removed. Register the notifier type track_flush_slot to receive memslot move and remove event. Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com> Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com> [Squashed commits to avoid bisection breakage and reworded the subject.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: refactor setup of global page-sized bitmapsRadim Krčmář
We've had 10 page-sized bitmaps that were being allocated and freed one by one when we could just use a cycle. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: join functions that disable x2apic msr interceptsRadim Krčmář
vmx_disable_intercept_msr_read_x2apic() and vmx_disable_intercept_msr_write_x2apic() differed only in the type. Pass the type to a new function. [Ordered and commented TPR intercept according to Paolo's suggestion.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: remove functions that enable msr interceptsRadim Krčmář
All intercepts are enabled at the beginning, so they can only be used if we disabled an intercept that we wanted to have enabled. This was done for TMCCT to simplify a loop that disables all x2APIC MSR intercepts, but just keeping TMCCT enabled yields better results. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>