summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-08-29KVM: x86/mmu: Drop pointless "return" wrapper label in FNAME(fetch)Sean Christopherson
Drop the pointless and poorly named "out_gpte_changed" label, in FNAME(fetch), and instead return RET_PF_RETRY directly. No functional change intended. Link: https://lore.kernel.org/r/20240802203900.348808-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86/mmu: Decrease indentation in logic to sync new indirect shadow pageSean Christopherson
Combine the back-to-back if-statements for synchronizing children when linking a new indirect shadow page in order to decrease the indentation, and to make it easier to "see" the logic in its entirety. No functional change intended. Link: https://lore.kernel.org/r/20240802203900.348808-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)Sean Christopherson
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()Sean Christopherson
Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bitsSean Christopherson
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29x86/EISA: Dereference memory directly instead of using readl()Maciej W. Rozycki
Sparse expect an __iomem pointer, but after converting the EISA probe to memremap() the pointer is a regular memory pointer. Access it directly instead. [ tglx: Converted it to fix the already applied version ] Fixes: 80a4da05642c ("x86/EISA: Use memremap() to probe for the EISA BIOS signature") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/alpine.DEB.2.21.2408261015270.30766@angie.orcam.me.uk
2024-08-28KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNPVitaly Kuznetsov
SEV-SNP support is present since commit 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") but Kconfig entry wasn't updated and still mentions SEV and SEV-ES only. Add SEV-SNP there and, while on it, expand 'SEV' in the description as 'Encrypted VMs' is not what 'SEV' stands for. No functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240828122111.160273-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-28x86/resctrl: Fix arch_mbm_* array overrun on SNCPeter Newman
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring groups may be allocated RMID values which would overrun the arch_mbm_{local,total} arrays. This is due to inconsistencies in whether the SNC-adjusted num_rmid value or the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The num_rmid value for the L3 resource is currently: resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache As a simple fix, make resctrl_arch_system_num_rmid_idx() return the SNC-adjusted, L3 num_rmid value on x86. Fixes: e13db55b5a0d ("x86/resctrl: Introduce snc_nodes_per_l3_cache") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240822190212.1848788-1-peternewman@google.com
2024-08-27virt: sev-guest: Ensure the SNP guest messages do not exceed a pageNikunj A Dadhania
Currently, struct snp_guest_msg includes a message header (96 bytes) and a payload (4000 bytes). There is an implicit assumption here that the SNP message header will always be 96 bytes, and with that assumption the payload array size has been set to 4000 bytes - a magic number. If any new member is added to the SNP message header, the SNP guest message will span more than a page. Instead of using a magic number for the payload, declare struct snp_guest_msg in a way that payload plus the message header do not exceed a page. [ bp: Massage. ] Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240731150811.156771-5-nikunj@amd.com
2024-08-26x86/tdx: Fix data leak in mmio_read()Kirill A. Shutemov
The mmio_read() function makes a TDVMCALL to retrieve MMIO data for an address from the VMM. Sean noticed that mmio_read() unintentionally exposes the value of an initialized variable (val) on the stack to the VMM. This variable is only needed as an output value. It did not need to be passed to the VMM in the first place. Do not send the original value of *val to the VMM. [ dhansen: clarify what 'val' is used for. ] Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240826125304.1566719-1-kirill.shutemov%40linux.intel.com
2024-08-26x86/ioremap: Improve iounmap() address range checksMax Ramanouski
Allowing iounmap() on memory that was not ioremap()'d in the first place is obviously a bad idea. There is currently a feeble attempt to avoid errant iounmap()s by checking to see if the address is below "high_memory". But that's imprecise at best because there are plenty of high addresses that are also invalid to call iounmap() on. Thankfully, there is a more precise helper: is_ioremap_addr(). x86 just does not use it in iounmap(). Restrict iounmap() to addresses in the ioremap region, by using is_ioremap_addr(). This aligns x86 closer to the generic iounmap() implementation. Additionally, add a warning in case there is an attempt to iounmap() invalid memory. This replaces an existing silent return and will help alert folks to any incorrect usage of iounmap(). Due to VMALLOC_START on i386 not being present in asm/pgtable.h, include for asm/vmalloc.h had to be added to include/linux/ioremap.h. [ dhansen: tweak subject and changelog ] Signed-off-by: Max Ramanouski <max8rr8@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Link: https://lore.kernel.org/all/20240824220111.84441-1-max8rr8%40gmail.com
2024-08-25x86/entry: Set FRED RSP0 on return to userspace instead of context switchXin Li (Intel)
The FRED RSP0 MSR points to the top of the kernel stack for user level event delivery. As this is the task stack it needs to be updated when a task is scheduled in. The update is done at context switch. That means it's also done when switching to kernel threads, which is pointless as those never go out to user space. For KVM threads this means there are two writes to FRED_RSP0 as KVM has to switch to the guest value before VMENTER. Defer the update to the exit to user space path and cache the per CPU FRED_RSP0 value, so redundant writes can be avoided. Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual MSR value after returning from guest to host mode. [ tglx: Massage change log ] Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240822073906.2176342-4-xin@zytor.com
2024-08-25x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanismAndrew Cooper
Per the discussion about FRED MSR writes with WRMSRNS instruction [1], use the alternatives mechanism to choose WRMSRNS when it's available, otherwise fallback to WRMSR. Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer dependent on FRED. [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/ Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At least that's the current information from the hardware folks. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240822073906.2176342-3-xin@zytor.com
2024-08-25x86/entry: Test ti_work for zero before processing individual bitsXin Li (Intel)
In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() are zeros, e.g., 99% in kernel build tests. So an obvious optimization is to test ti_work for zero before processing individual bits in it. Omit the optimization when FPU debugging is enabled, otherwise the FPU consistency check is never executed. Intel 0day tests did not find a perfermance regression with this change. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240822073906.2176342-2-xin@zytor.com
2024-08-25perf/x86/intel: Limit the period on HaswellKan Liang
Running the ltp test cve-2015-3290 concurrently reports the following warnings. perfevents: irq loop stuck! WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370 Call Trace: <NMI> ? __warn+0xa4/0x220 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? report_bug+0x3e/0xa0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? irq_work_claim+0x1e/0x40 ? intel_pmu_handle_irq+0x285/0x370 perf_event_nmi_handler+0x3d/0x60 nmi_handle+0x104/0x330 Thanks to Thomas Gleixner's analysis, the issue is caused by the low initial period (1) of the frequency estimation algorithm, which triggers the defects of the HW, specifically erratum HSW11 and HSW143. (For the details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/) The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL event, but the initial period in the freq mode is 1. The erratum is the same as the BDM11, which has been supported in the kernel. A minimum period of 128 is enforced as well on HSW. HSW143 is regarding that the fixed counter 1 may overcount 32 with the Hyper-Threading is enabled. However, based on the test, the hardware has more issues than it tells. Besides the fixed counter 1, the message 'interrupt took too long' can be observed on any counter which was armed with a period < 32 and two events expired in the same NMI. A minimum period of 32 is enforced for the rest of the events. The recommended workaround code of the HSW143 is not implemented. Because it only addresses the issue for the fixed counter. It brings extra overhead through extra MSR writing. No related overcounting issue has been reported so far. Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support") Reported-by: Li Huafei <lihuafei1@huawei.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.com Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
2024-08-25x86/fred: Set SS to __KERNEL_DS when enabling FREDXin Li (Intel)
SS is initialized to NULL during boot time and not explicitly set to __KERNEL_DS. With FRED enabled, if a kernel event is delivered before a CPU goes to user level for the first time, its SS is NULL thus NULL is pushed into the SS field of the FRED stack frame. But before ERETS is executed, the CPU may context switch to another task and go to user level. Then when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS. Later when ERETS is executed to return from the kernel event handler, a #GP fault is generated because SS doesn't match the SS saved in the FRED stack frame. Initialize SS to __KERNEL_DS when enabling FRED to prevent that. Note, IRET doesn't check if SS matches the SS saved in its stack frame, thus IDT doesn't have this problem. For IDT it doesn't matter whether SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt or exception delivery and __KERNEL_DS upon SYSCALL. Thus it's pointless to initialize SS for IDT. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240816104316.2276968-1-xin@zytor.com
2024-08-25x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h-70hRichard Gong
Add new PCI IDs for Device 18h and Function 4 to enable the amd_atl driver on those systems. Signed-off-by: Richard Gong <richard.gong@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/all/20240819123041.915734-1-richard.gong@amd.com
2024-08-25x86/extable: Remove unused declaration fixup_bug()Yue Haibing
Commit 15a416e8aaa7 ("x86/entry: Treat BUG/WARN as NMI-like entries") removed the implementation but left the declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240816102219.883297-1-yuehaibing@huawei.com
2024-08-25x86/boot/64: Strip percpu address space when setting up GDT descriptorsUros Bizjak
init_per_cpu_var() returns a pointer in the percpu address space while rip_rel_ptr() expects a pointer in the generic address space. When strict address space checks are enabled, GCC's named address space checks fail: asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from pointer to non-enclosed address space Add a explicit cast to remove address space of the returned pointer. Fixes: 11e36b0f7c21 ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240819083334.148536-1-ubizjak@gmail.com
2024-08-25x86/cpu: Clarify the error message when BIOS does not support SGXWangYuli
When SGX is not supported by the BIOS, the kernel log contains the error 'SGX disabled by BIOS', which can be confusing since there might not be an SGX-related option in the BIOS settings. For the kernel it's difficult to distinguish between the BIOS not supporting SGX and the BIOS supporting SGX but having it disabled. Therefore, update the error message to 'SGX disabled or unsupported by BIOS' to make it easier for those reading kernel logs to understand what's happening. Reported-by: Bo Wu <wubo@uniontech.com> Co-developed-by: Zelong Xiang <xiangzelong@uniontech.com> Signed-off-by: Zelong Xiang <xiangzelong@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/F8D977CB368423F3+20240825104653.1294624-1-wangyuli@uniontech.com Closes: https://github.com/linuxdeepin/developer-center/issues/10032
2024-08-25x86/kexec: Add comments around swap_pages() assembly to improve readabilityKai Huang
The current assembly around swap_pages() in the relocate_kernel() takes some time to follow because the use of registers can be easily lost when the line of assembly goes long. Add a couple of comments to clarify the code around swap_pages() to improve readability. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
2024-08-25x86/kexec: Fix a comment of swap_pages() assemblyKai Huang
When relocate_kernel() gets called, %rdi holds 'indirection_page' and %rsi holds 'page_list'. And %rdi always holds 'indirection_page' when swap_pages() is called. Therefore the comment of the first line code of swap_pages() movq %rdi, %rcx /* Put the page_list in %rcx */ .. isn't correct because it actually moves the 'indirection_page' to the %rcx. Fix it. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
2024-08-25x86/sgx: Fix a W=1 build warning in function commentKai Huang
Building the SGX code with W=1 generates below warning: arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'low' not described in 'sgx_calc_section_metric' arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'high' not described in 'sgx_calc_section_metric' ... The function sgx_calc_section_metric() is a simple helper which is only used in sgx/main.c. There's no need to use kernel-doc style comment for it. Downgrade to a normal comment. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240825080649.145250-1-kai.huang@intel.com
2024-08-25x86/EISA: Use memremap() to probe for the EISA BIOS signatureMaciej W. Rozycki
The area at the 0x0FFFD9 physical location in the PC memory space is regular memory, traditionally ROM BIOS and more recently a copy of BIOS code and data in RAM, write-protected. Therefore use memremap() to get access to it rather than ioremap(), avoiding issues in virtualization scenarios and complementing changes such as commit f7750a795687 ("x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use memremap() for RAM mappings") or commit 5997efb96756 ("x86/boot: Use memremap() to map the MPF and MPC data"). Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/alpine.DEB.2.21.2408242025210.30766@angie.orcam.me.uk Closes: https://lore.kernel.org/r/20240822095122.736522-1-kirill.shutemov@linux.intel.com
2024-08-25x86/mtrr: Remove obsolete declaration for mtrr_bp_restore()Gaosheng Cui
mtrr_bp_restore() has been removed in commit 0b9a6a8bedbf ("x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()"), but the declaration was left behind. Remove it. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240824120234.2516830-1-cuigaosheng1@huawei.com
2024-08-25x86/cpu_entry_area: Annotate percpu_setup_exception_stacks() as __initNathan Chancellor
After a recent LLVM change that deduces __cold on functions that only call cold code (such as __init functions), there is a section mismatch warning from percpu_setup_exception_stacks(), which got moved to .text.unlikely. as a result of that optimization: WARNING: modpost: vmlinux: section mismatch in reference: percpu_setup_exception_stacks+0x3a (section: .text.unlikely.) -> cea_map_percpu_pages (section: .init.text) Drop the inline keyword, which does not guarantee inlining, and replace it with __init, as percpu_setup_exception_stacks() is only called from __init code, which clears up the warning. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240822-x86-percpu_setup_exception_stacks-init-v1-1-57c5921b8209@kernel.org
2024-08-24crypto: x86/sha256 - Add parentheses around macros' single argumentsFangrui Song
The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an unexpected/undocumented behavior of the GNU assembler, which might change in the future (https://sourceware.org/bugzilla/show_bug.cgi?id=32073). M (1) (2) // 1 arg !? Future: 2 args M 1 + 2 // 1 arg !? Future: 3 args M 1 2 // 2 args Add parentheses around the single arguments to support future GNU assembler and LLVM integrated assembler (when the IsOperator hack from the following link is dropped). Link: https://github.com/llvm/llvm-project/commit/055006475e22014b28a070db1bff41ca15f322f0 Signed-off-by: Fangrui Song <maskray@google.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-23x86/hyperv: use helpers to read control registers in hv_snp_boot_ap()Yosry Ahmed
Use native_read_cr*() helpers to read control registers into vmsa->cr* instead of open-coded assembly. No functional change intended, unless there was a purpose to specifying rax. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Link: https://lore.kernel.org/r/20240805201247.427982-1-yosryahmed@google.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240805201247.427982-1-yosryahmed@google.com>
2024-08-23x86/syscall: Avoid memcpy() for ia32 syscall_get_arguments()Kees Cook
Modern (fortified) memcpy() prefers to avoid writing (or reading) beyond the end of the addressed destination (or source) struct member: In function ‘fortify_memcpy_chk’, inlined from ‘syscall_get_arguments’ at ./arch/x86/include/asm/syscall.h:85:2, inlined from ‘populate_seccomp_data’ at kernel/seccomp.c:258:2, inlined from ‘__seccomp_filter’ at kernel/seccomp.c:1231:3: ./include/linux/fortify-string.h:580:25: error: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning] 580 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As already done for x86_64 and compat mode, do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Binary output differences are negligible, and actually ends up using less stack space: - sub $0x84,%esp + sub $0x6c,%esp and less text size: text data bss dec hex filename 10794 252 0 11046 2b26 gcc-32b/kernel/seccomp.o.stock 10714 252 0 10966 2ad6 gcc-32b/kernel/seccomp.o.after Closes: https://lore.kernel.org/lkml/9b69fb14-df89-4677-9c82-056ea9e706f5@gmail.com/ Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Mirsad Todorovac <mtodorovac69@gmail.com> Link: https://lore.kernel.org/all/20240708202202.work.477-kees%40kernel.org
2024-08-22KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missingRavi Bangoria
If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: 76ea438b4afc ("KVM: X86: Expose bus lock debug exception to guest") Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Suppress userspace access failures on unsupported, "emulated" MSRsSean Christopherson
Extend KVM's suppression of userspace MSR access failures to MSRs that KVM reports as emulated, but are ultimately unsupported, e.g. if the VMX MSRs are emulated by KVM, but are unsupported given the vCPU model. Suggested-by: Weijiang Yang <weijiang.yang@intel.com> Reviewed-by: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20240802181935.292540-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Suppress failures on userspace access to advertised, unsupported MSRsSean Christopherson
Extend KVM's suppression of failures due to a userspace access to an unsupported, but advertised as a "to save" MSR to all MSRs, not just those that happen to reach the default case statements in kvm_get_msr_common() and kvm_set_msr_common(). KVM's soon-to-be-established ABI is that if an MSR is advertised to userspace, then userspace is allowed to read the MSR, and write back the value that was read, i.e. why an MSR is unsupported doesn't change KVM's ABI. Practically speaking, this is very nearly a nop, as the only other paths that return KVM_MSR_RET_UNSUPPORTED are {svm,vmx}_get_feature_msr(), and it's unlikely, though not impossible, that userspace is using KVM_GET_MSRS on unsupported MSRs. The primary goal of moving the suppression to common code is to allow returning KVM_MSR_RET_UNSUPPORTED as appropriate throughout KVM, without having to manually handle the "is userspace accessing an advertised" waiver. I.e. this will allow formalizing KVM's ABI without incurring a high maintenance cost. Link: https://lore.kernel.org/r/20240802181935.292540-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Hoist x86.c's global msr_* variables up above kvm_do_msr_access()Sean Christopherson
Move the definitions of the various MSR arrays above kvm_do_msr_access() so that kvm_do_msr_access() can query the arrays when handling failures, e.g. to squash errors if userspace tries to read an MSR that isn't fully supported, but that KVM advertised as being an MSR-to-save. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Funnel all fancy MSR return value handling into a common helperSean Christopherson
Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that are type and access specific, and more importantly to handle errors that are returned from the leaf APIs. I.e. turn kvm_msr_ignored_check() from a a helper that is called on an error, into a trampoline that detects errors *and* applies relevant side effects, e.g. logging unimplemented accesses. Because the leaf APIs are used for guest accesses, userspace accesses, and KVM accesses, and because KVM supports restricting access to MSRs from userspace via filters, the error handling is subtly non-trivial. E.g. KVM has had at least one bug escape due to making each "outer" function handle errors. See commit 3376ca3f1a20 ("KVM: x86: Fix KVM_GET_MSRS stack info leak"). Link: https://lore.kernel.org/r/20240802181935.292540-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entrySean Christopherson
Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as separate parameters, along with a vCPU pointer, i.e. to give it the same prototype as kvm_{g,s}et_msr_ignored_check(). This will allow using a common inner helper for handling accesses to "regular" and feature MSRs. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()Sean Christopherson
Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entrySean Christopherson
Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTEDSean Christopherson
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Move MSR_TYPE_{R,W,RW} values from VMX to x86, as enumsSean Christopherson
Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can be used by common x86 code, e.g. instead of doing "bool write". Opportunistically tweak the definitions to make it more obvious that the values are bitmasks, not arbitrary ascending values. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG valueSean Christopherson
Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its *current* value, not if the guest attempts to write a value other than KVM's set of supported bits. As per the comment and the changelog of the original code, the intent is to effectively make MSR_AMD64_DE_CFG read- only for the guest. Opportunistically use a more conventional equality check instead of an exclusive-OR check to detect attempts to change bits. Fixes: d1d93fa90f1a ("KVM: SVM: Add MSR-based feature support for serializing LFENCE") Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240802181935.292540-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86/mmu: Clean up function comments for dirty logging APIsSean Christopherson
Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked() into the body of the function, as it has gotten a bit stale, is harder to read without the code context, and is the last source of warnings for W=1 builds in KVM x86 due to using a kernel-doc comment without documenting all parameters. Opportunistically subsume the functions comments for kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as there is no value in regurgitating similar information at a higher level, and capturing the differences between write-protection and PML-based dirty logging is best done in a common location. No functional change intended. Cc: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20240802202006.340854-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_onlineLi Chen
Use this_cpu_ptr() instead of open coding the equivalent in kvm_user_return_msr_cpu_online. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()Vitaly Kuznetsov
GCC 12.3.0 complains about a potential NULL pointer dereference in evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a valid VP assist page and aborts enabling the feature otherwise. CPU onlining path is also checked in vmx_hardware_enable(). To make the compiler happy and to future proof the code, add a KVM_BUG_ON() sentinel. It doesn't seem to be possible (and logical) to observe evmcs_load() happening without an active vCPU so it is presumed that kvm_get_running_vcpu() can't return NULL. No functional change intended. Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalentMaxim Levitsky
In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of setting segment_cache.bitmask directly. Using the helper minimizes the chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g. if KVM ends up doing more than just zero the bitmask when purging the cache. No functional change intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-ExitSean Christopherson
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: VMX: Do not account for temporary memory allocation in ECREATE emulationKai Huang
In handle_encls_ecreate(), a page is allocated to store a copy of SECS structure used by the ENCLS[ECREATE] leaf from the guest. This page is only used temporarily and is freed after use in handle_encls_ecreate(). Don't account for the memory allocation of this page per [1]. Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the ↵Qiang Liu
vmcs_check16 function According to the SDM, the meaning of field bit 0 is: Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit, and natural-width fields. So there is no 32-bit high field here, it should be a 32-bit field instead. Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn> Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state()Yongqiang Liu
The fixed size temporary variables vmcb_control_area and vmcb_save_area allocated in svm_set_nested_state() are released when the function exits. Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com> Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: nVMX: Use macros and #defines in vmx_restore_vmx_misc()Xin Li
Use macros in vmx_restore_vmx_misc() instead of open coding everything using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits and reserved bits into separate variables, and add a comment explaining the subset logic (it's not immediately obvious that the set of feature bits is NOT the set of _supported_ feature bits). Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog, drop #defines] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: VMX: Open code VMX preemption timer rate mask in its accessorXin Li
Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(), and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so that the function looks like all the helpers that grab values from VMX_BASIC and VMX_MISC MSR values. No functional change intended. Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog] Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>