summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2024-12-17x86/cpu: Introduce new microcode matching helperDave Hansen
The 'x86_cpu_id' and 'x86_cpu_desc' structures are very similar and need to be consolidated. There is a microcode version matching function for 'x86_cpu_desc' but not 'x86_cpu_id'. Create one for 'x86_cpu_id'. This essentially just leverages the x86_cpu_id->driver_data field to replace the less generic x86_cpu_desc->x86_microcode_rev field. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185128.8F24EEFC%40davehans-spike.ostc.intel.com
2024-12-17x86/xen: remove hypercall pageJuergen Gross
The hypercall page is no longer needed. It can be removed, as from the Xen perspective it is optional. But, from Linux's perspective, it removes naked RET instructions that escape the speculative protections that Call Depth Tracking and/or Untrain Ret are trying to achieve. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
2024-12-14x86/sev: Require the RMPREAD instruction after Zen4Tom Lendacky
Limit usage of the non-architectural RMP format to Zen3/Zen4 processors. The RMPREAD instruction, with architectural defined output, is available and should be used for RMP access beyond Zen4. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Reviewed-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/5be0093e091778a151266ea853352f62f838eb99.1733172653.git.thomas.lendacky@amd.com
2024-12-13x86/static-call: provide a way to do very early static-call updatesJuergen Gross
Add static_call_update_early() for updating static-call targets in very early boot. This will be needed for support of Xen guest type specific hypercall functions. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2024-12-13x86: make get_cpu_vendor() accessible from Xen codeJuergen Gross
In order to be able to differentiate between AMD and Intel based systems for very early hypercalls without having to rely on the Xen hypercall page, make get_cpu_vendor() non-static. Refactor early_cpu_init() for the same reason by splitting out the loop initializing cpu_devs() into an externally callable function. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-12-12x86/resctrl: Add write option to "mba_MBps_event" fileTony Luck
The "mba_MBps" mount option provides an alternate method to control memory bandwidth. Instead of specifying allowable bandwidth as a percentage of maximum possible, the user provides a MiB/s limit value. There is a file in each CTRL_MON group directory that shows the event currently in use. Allow writing that file to choose a different event. A user can choose any of the memory bandwidth monitoring events listed in /sys/fs/resctrl/info/L3_mon/mon_features independently for each CTRL_MON group by writing to each of the "mba_MBps_event" files. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-8-tony.luck@intel.com
2024-12-12x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directoriesTony Luck
The "mba_MBps" mount option provides an alternate method to control memory bandwidth. Instead of specifying allowable bandwidth as a percentage of maximum possible, the user provides a MiB/s limit value. In preparation to allow the user to pick the memory bandwidth monitoring event used as input to the feedback loop, provide a file in each CTRL_MON group directory that shows the event currently in use. Note that this file is only visible when the "mba_MBps" mount option is in use. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-7-tony.luck@intel.com
2024-12-10x86/cpu: Fix typo in x86_match_cpu()'s docRaag Jadav
Fix typo in x86_match_cpu()'s description. [ bp: Massage commit message. ] Signed-off-by: Raag Jadav <raag.jadav@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241030065804.407793-1-raag.jadav@intel.com
2024-12-10Merge branch 'linus' into x86/cleanups, to resolve conflictIngo Molnar
These two commits interact: upstream: 73da582a476e ("x86/cpu/topology: Remove limit of CPUs due to disabled IO/APIC") x86/cleanups: 13148e22c151 ("x86/apic: Remove "disablelapic" cmdline option") Resolve it. Conflicts: arch/x86/kernel/cpu/topology.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-12-10x86/apic: Remove "disablelapic" cmdline optionBorislav Petkov (AMD)
The convention is "no<something>" and there already is "nolapic". Drop the disable one. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20241202190011.11979-2-bp@kernel.org
2024-12-10Documentation: Merge x86-specific boot options doc into kernel-parameters.txtBorislav Petkov (AMD)
Documentation/arch/x86/x86_64/boot-options.rst is causing unnecessary confusion by being a second place where one can put x86 boot options. Move them into the main one. Drop removed ones like "acpi=ht", while at it. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20241202190011.11979-1-bp@kernel.org
2024-12-10x86/resctrl: Make mba_sc use total bandwidth if local is not supportedTony Luck
The default input measurement to the mba_sc feedback loop for memory bandwidth control when the user mounts with the "mba_MBps" option is the local bandwidth event. But some systems may not support a local bandwidth event. When local bandwidth event is not supported, check for support of total bandwidth and use that instead. Relax the mount option check to allow use of the "mba_MBps" option for systems when only total bandwidth monitoring is supported. Also update the error message. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-6-tony.luck@intel.com
2024-12-10x86/resctrl: Compute memory bandwidth for all supported eventsTony Luck
Switching between local and total memory bandwidth events as the input to the mba_sc feedback loop would be cumbersome and take effect slowly in the current implementation as the bandwidth is only known after two consecutive readings of the same event. Compute the bandwidth for all supported events. This doesn't add significant overhead and will make changing which event is used simple. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-5-tony.luck@intel.com
2024-12-10x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on ↵Ard Biesheuvel
GCC-12 In __startup_64(), the bool 'la57' can only assume the 'true' value if CONFIG_X86_5LEVEL is enabled in the build, and generally, the compiler can make this inference at build time, and elide any references to the symbol 'level4_kernel_pgt', which may be undefined if 'la57' is false. As it turns out, GCC 12 gets this wrong sometimes, and gives up with a build error: ld: arch/x86/kernel/head64.o: in function `__startup_64': head64.c:(.head.text+0xbd): undefined reference to `level4_kernel_pgt' even though the reference is in unreachable code. Fix this by duplicating the IS_ENABLED(CONFIG_X86_5LEVEL) in the conditional that tests the value of 'la57'. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241209094105.762857-2-ardb+git@google.com Closes: https://lore.kernel.org/oe-kbuild-all/202412060403.efD8Kgb7-lkp@intel.com/
2024-12-10x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group eventTony Luck
update_mba_bw() hard codes use of the memory bandwidth local event which prevents more flexible options from being deployed. Change this function to use the event specified in the rdtgroup that is being processed. Mount time checks for the "mba_MBps" option ensure that local memory bandwidth is enabled. So drop the redundant is_mbm_local_enabled() check. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-4-tony.luck@intel.com
2024-12-10x86/resctrl: Prepare for per-CTRL_MON group mba_MBps controlTony Luck
Resctrl uses local memory bandwidth event as input to the feedback loop when the mba_MBps mount option is used. This means that this mount option cannot be used on systems that only support monitoring of total bandwidth. Prepare to allow users to choose the input event independently for each CTRL_MON group by adding a global variable "mba_mbps_default_event" used to set the default event for each CTRL_MON group, and a new field "mba_mbps_event" in struct rdtgroup to track which event is used for each CTRL_MON group. Notes: 1) Both of these are only used when the user mounts the filesystem with the "mba_MBps" option. 2) Only check for support of local bandwidth event when initializing mba_mbps_default_event. Support for total bandwidth event can be added after other routines in resctrl have been updated to handle total bandwidth event. [ bp: Move mba_mbps_default_event extern into the arch header. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20241206163148.83828-3-tony.luck@intel.com
2024-12-09x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflagsBabu Moger
thread_throttle_mode_init() and mbm_config_rftype_init() both initialize fflags for resctrl files. Adding new files will involve adding another function to initialize the fflags. This can be simplified by adding a new function resctrl_file_fflags_init() and passing the file name and flags to be initialized. Consolidate fflags initialization into resctrl_file_fflags_init() and remove thread_throttle_mode_init() and mbm_config_rftype_init(). [ Tony: Drop __init attribute so resctrl_file_fflags_init() can be used at run time. ] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20241206163148.83828-2-tony.luck@intel.com
2024-12-09x86/resctrl: Use kthread_run_on_cpu()Frederic Weisbecker
Use the proper API instead of open coding it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240807160228.26206-3-frederic@kernel.org
2024-12-09x86/hyperv: Fix hv tsc page based sched_clock for hibernationNaman Jain
read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is bigger than the variable hv_sched_clock_offset, which is cached during early boot, but depending on the timing this assumption may be false when a hibernated VM starts again (the clock counter starts from 0 again) and is resuming back (Note: hv_init_tsc_clocksource() is not called during hibernation/resume); consequently, read_hv_sched_clock_tsc() may return a negative integer (which is interpreted as a huge positive integer since the return type is u64) and new kernel messages are prefixed with huge timestamps before read_hv_sched_clock_tsc() grows big enough (which typically takes several seconds). Fix the issue by saving the Hyper-V clock counter just before the suspend, and using it to correct the hv_sched_clock_offset in resume. This makes hv tsc page based sched_clock continuous and ensures that post resume, it starts from where it left off during suspend. Override x86_platform.save_sched_clock_state and x86_platform.restore_sched_clock_state routines to correct this as soon as possible. Note: if Invariant TSC is available, the issue doesn't happen because 1) we don't register read_hv_sched_clock_tsc() for sched clock: See commit e5313f1c5404 ("clocksource/drivers/hyper-v: Rework clocksource and sched clock setup"); 2) the common x86 code adjusts TSC similarly: see __restore_processor_state() -> tsc_verify_tsc_adjust(true) and x86_platform.restore_sched_clock_state(). Cc: stable@vger.kernel.org Fixes: 1349401ff1aa ("clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for hibernation") Co-developed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20240917053917.76787-1-namjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240917053917.76787-1-namjain@linux.microsoft.com>
2024-12-09x86: Fix build regression with CONFIG_KEXEC_JUMP enabledDamien Le Moal
Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error: ld: vmlinux.o: in function `virtual_mapped': linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc' when CONFIG_KEXEC_JUMP is enabled. This was introduced by commit 07fa619f2a40 ("x86/kexec: Restore GDT on return from ::preserve_context kexec") which introduced a use of saved_context_gdt_desc without a declaration for it. Fix that by including asm/asm-offsets.h where saved_context_gdt_desc is defined (indirectly in include/generated/asm-offsets.h which asm/asm-offsets.h includes). Fixes: 07fa619f2a40 ("x86/kexec: Restore GDT on return from ::preserve_context kexec") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-06x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()Kirill A. Shutemov
Rename the helper to better reflect its function. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20241202073139.448208-1-kirill.shutemov@linux.intel.com
2024-12-06x86/CPU/AMD: WARN when setting EFER.AUTOIBRS if and only if the WRMSR failsSean Christopherson
When ensuring EFER.AUTOIBRS is set, WARN only on a negative return code from msr_set_bit(), as '1' is used to indicate the WRMSR was successful ('0' indicates the MSR bit was already set). Fixes: 8cc68c9c9e92 ("x86/CPU/AMD: Make sure EFER[AIBRSE] is set") Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/Z1MkNofJjt7Oq0G6@google.com Closes: https://lore.kernel.org/all/20241205220604.GA2054199@thelio-3990X
2024-12-06x86/cacheinfo: Delete global num_cache_leavesRicardo Neri
Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all CPUs from the same global "num_cache_leaves". This is erroneous on systems such as Meteor Lake, where each CPU has a distinct num_leaves value. Delete the global "num_cache_leaves" and initialize num_leaves on each CPU. init_cache_level() no longer needs to set num_leaves. Also, it never had to set num_levels as it is unnecessary in x86. Keep checking for zero cache leaves. Such condition indicates a bug. [ bp: Cleanup. ] Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org # 6.3+ Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
2024-12-06x86/sysfs: Constify 'struct bin_attribute'Thomas Weißschuh
The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241202-sysfs-const-bin_attr-x86-v1-1-b767d5f0ac5c@weissschuh.net
2024-12-06x86/paravirt: Remove the WBINVD callbackJuergen Gross
The pv_ops::cpu.wbinvd paravirt callback is a leftover of lguest times. Today it is no longer needed, as all users use the native WBINVD implementation. Remove the callback and rename native_wbinvd() to wbinvd(). Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241203071550.26487-1-jgross@suse.com
2024-12-06x86/cpufeatures: Free up unused feature bitsSohil Mehta
Linux defined feature bits X86_FEATURE_P3 and X86_FEATURE_P4 are not used anywhere. Commit f31d731e4467 ("x86: use X86_FEATURE_NOPL in alternatives") got rid of the last usage in 2008. Remove the related mappings and code. Just like all X86_FEATURE bits, the raw bit numbers can be exposed to userspace via MODULE_DEVICE_TABLE(). There is a very small theoretical chance of userspace getting confused if these bits got reassigned and changed logical meaning. But these bits were never used for a device table, so it's highly unlikely this will ever happen in practice. [ dhansen: clarify userspace visibility of these bits ] Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20241107233000.2742619-1-sohil.mehta%40intel.com
2024-12-06x86/kexec: Mark relocate_kernel page as ROX instead of RWXDavid Woodhouse
All writes to the page now happen before it gets marked as executable (or after it's already switched to the identmap page tables where it's OK to be RWX). Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241205153343.3275139-14-dwmw2@infradead.org
2024-12-06x86/kexec: Clean up register usage in relocate_kernel()David Woodhouse
The memory encryption flag is passed in %r8 because that's where the calling convention puts it. Instead of moving it to %r12 and then using %r8 for other things, just leave it in %r8 and use other registers instead. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-13-dwmw2@infradead.org
2024-12-06x86/kexec: Eliminate writes through kernel mapping of relocate_kernel pageDavid Woodhouse
All writes to the relocate_kernel control page are now done *after* the %cr3 switch via simple %rip-relative addressing, which means the DATA() macro with its pointer arithmetic can also now be removed. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-12-dwmw2@infradead.org
2024-12-06x86/kexec: Drop page_list argument from relocate_kernel()David Woodhouse
The kernel's virtual mapping of the relocate_kernel page currently needs to be RWX because it is written to before the %cr3 switch. Now that the relocate_kernel page has its own .data section and local variables, it can also have *global* variables. So eliminate the separate page_list argument, and write the same information directly to variables in the relocate_kernel page instead. This way, the relocate_kernel code itself doesn't need to copy it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-11-dwmw2@infradead.org
2024-12-06x86/kexec: Add data section to relocate_kernelDavid Woodhouse
Now that the relocate_kernel page is handled sanely by a linker script we can have actual data, and just use %rip-relative addressing to access it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-10-dwmw2@infradead.org
2024-12-06x86/kexec: Move relocate_kernel to kernel .data sectionDavid Woodhouse
Now that the copy is executed instead of the original, the relocate_kernel page can live in the kernel's .text section. This will allow subsequent commits to actually add real data to it and clean up the code somewhat as well as making the control page ROX. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-9-dwmw2@infradead.org
2024-12-06x86/kexec: Invoke copy of relocate_kernel() instead of the originalDavid Woodhouse
This currently calls set_memory_x() from machine_kexec_prepare() just like the 32-bit version does. That's actually a bit earlier than I'd like, as it leaves the page RWX all the time the image is even *loaded*. Subsequent commits will eliminate all the writes to the page between the point it's marked executable in machine_kexec_prepare() the time that relocate_kernel() is running and has switched to the identmap %cr3, so that it can be ROX. But that can't happen until it's moved to the .data section of the kernel, and *that* can't happen until we start executing the copy instead of executing it in place in the kernel .text. So break the circular dependency in those commits by letting it be RWX for now. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-8-dwmw2@infradead.org
2024-12-06x86/kexec: Copy control page into place in machine_kexec_prepare()David Woodhouse
There's no need for this to wait until the actual machine_kexec() invocation; future changes will need to make the control page read-only and executable, so all writes should be completed before machine_kexec_prepare() returns. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-7-dwmw2@infradead.org
2024-12-06x86/kexec: Allocate PGD for x86_64 transition page tables separatelyDavid Woodhouse
Now that the following fix: d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables") stops kernel_ident_mapping_init() from scribbling over the end of a 4KiB PGD by assuming the following 4KiB will be a userspace PGD, there's no good reason for the kexec PGD to be part of a single 8KiB allocation with the control_code_page. ( It's not clear that that was the reason for x86_64 kexec doing it that way in the first place either; there were no comments to that effect and it seems to have been the case even before PTI came along. It looks like it was just a happy accident which prevented memory corruption on kexec. ) Either way, it definitely isn't needed now. Just allocate the PGD separately on x86_64, like i386 already does. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-6-dwmw2@infradead.org
2024-12-06x86/kexec: Only swap pages for ::preserve_context modeDavid Woodhouse
There's no need to swap pages (which involves three memcopies for each page) in the plain kexec case. Just do a single copy from source to destination page. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-5-dwmw2@infradead.org
2024-12-06x86/kexec: Use named labels in swap_pages in relocate_kernel_64.SDavid Woodhouse
Make the code a little more readable. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-4-dwmw2@infradead.org
2024-12-06x86/kexec: Clean up and document register use in relocate_kernel_64.SDavid Woodhouse
Add more comments explaining what each register contains, and save the preserve_context flag to a non-clobbered register sooner, to keep things simpler. No change in behavior intended. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205153343.3275139-3-dwmw2@infradead.org
2024-12-06Merge branch 'x86/urgent' into x86/boot, to pick up dependent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-12-06x86/kexec: Restore GDT on return from ::preserve_context kexecDavid Woodhouse
The restore_processor_state() function explicitly states that "the asm code that gets us here will have restored a usable GDT". That wasn't true in the case of returning from a ::preserve_context kexec. Make it so. Without this, the kernel was depending on the called function to reload a GDT which is appropriate for the kernel before returning. Test program: #include <unistd.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <linux/kexec.h> #include <linux/reboot.h> #include <sys/reboot.h> #include <sys/syscall.h> int main (void) { struct kexec_segment segment = {}; unsigned char purgatory[] = { 0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx 0xb0, 0x42, // mov $0x42, %al 0xee, // outb %al, (%dx) 0xc3, // ret }; int ret; segment.buf = &purgatory; segment.bufsz = sizeof(purgatory); segment.mem = (void *)0x400000; segment.memsz = 0x1000; ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT); if (ret) { perror("kexec_load"); exit(1); } ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC); if (ret) { perror("kexec reboot"); exit(1); } printf("Success\n"); return 0; } Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org
2024-12-05x86/cpu/topology: Remove limit of CPUs due to disabled IO/APICFernando Fernandez Mancera
The rework of possible CPUs management erroneously disabled SMP when the IO/APIC is disabled either by the 'noapic' command line parameter or during IO/APIC setup. SMP is possible without IO/APIC. Remove the ioapic_is_disabled conditions from the relevant possible CPU management code paths to restore the orgininal behaviour. Fixes: 7c0edad3643f ("x86/cpu/topology: Rework possible CPU management") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241202145905.1482-1-ffmancera@riseup.net
2024-12-05x86/boot: Move .head.text into its own output sectionArd Biesheuvel
In order to be able to double check that vmlinux is emitted without absolute symbol references in .head.text, it needs to be distinguishable from the rest of .text in the ELF metadata. So move .head.text into its own ELF section. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-15-ardb+git@google.com
2024-12-05x86/kernel: Move ENTRY_TEXT to the start of the imageArd Biesheuvel
Since commit: 7734a0f31e99 ("x86/boot: Robustify calling startup_{32,64}() from the decompressor code") it is no longer necessary for .head.text to appear at the start of the image. Since ENTRY_TEXT needs to appear PMD-aligned, it is easier to just place it at the start of the image, rather than line it up with the end of the .text section. The amount of padding required should be the same, but this arrangement also permits .head.text to be split off and emitted separately, which is needed by a subsequent change. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-14-ardb+git@google.com
2024-12-05x86/boot/64: Avoid intentional absolute symbol references in .head.textArd Biesheuvel
The code in .head.text executes from a 1:1 mapping and cannot generally refer to global variables using their kernel virtual addresses. However, there are some occurrences of such references that are valid: the kernel virtual addresses of _text and _end are needed to populate the page tables correctly, and some other section markers are used in a similar way. To avoid the need for making exceptions to the rule that .head.text must not contain any absolute symbol references, derive these addresses from the RIP-relative 1:1 mapped physical addresses, which can be safely determined using RIP_REL_REF(). Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-12-ardb+git@google.com
2024-12-05x86/boot/64: Determine VA/PA offset before entering C codeArd Biesheuvel
Implicit absolute symbol references (e.g., taking the address of a global variable) must be avoided in the C code that runs from the early 1:1 mapping of the kernel, given that this is a practice that violates assumptions on the part of the toolchain. I.e., RIP-relative and absolute references are expected to produce the same values, and so the compiler is free to choose either. However, the code currently assumes that RIP-relative references are never emitted here. So an explicit virtual-to-physical offset needs to be used instead to derive the kernel virtual addresses of _text and _end, instead of simply taking the addresses and assuming that the compiler will not choose to use a RIP-relative references in this particular case. Currently, phys_base is already used to perform such calculations, but it is derived from the kernel virtual address of _text, which is taken using an implicit absolute symbol reference. So instead, derive this VA-to-PA offset in asm code, and pass it to the C startup code. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-11-ardb+git@google.com
2024-12-04x86/cpu: Add Lunar Lake to list of CPUs with a broken MONITOR implementationLen Brown
Under some conditions, MONITOR wakeups on Lunar Lake processors can be lost, resulting in significant user-visible delays. Add Lunar Lake to X86_BUG_MONITOR so that wake_up_idle_cpu() always sends an IPI, avoiding this potential delay. Reported originally here: https://bugzilla.kernel.org/show_bug.cgi?id=219364 [ dhansen: tweak subject ] Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/a4aa8842a3c3bfdb7fe9807710eef159cbf0e705.1731463305.git.len.brown%40intel.com
2024-12-04x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()Kirill A. Shutemov
Rename the helper to better reflect its function. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/all/20241202073139.448208-1-kirill.shutemov%40linux.intel.com
2024-12-02x86/pkeys: Ensure updated PKRU value is XRSTOR'dAruna Ramakrishna
When XSTATE_BV[i] is 0, and XRSTOR attempts to restore state component 'i' it ignores any value in the XSAVE buffer and instead restores the state component's init value. This means that if XSAVE writes XSTATE_BV[PKRU]=0 then XRSTOR will ignore the value that update_pkru_in_sigframe() writes to the XSAVE buffer. XSTATE_BV[PKRU] only gets written as 0 if PKRU is in its init state. On Intel CPUs, basically never happens because the kernel usually overwrites the init value (aside: this is why we didn't notice this bug until now). But on AMD, the init tracker is more aggressive and will track PKRU as being in its init state upon any wrpkru(0x0). Unfortunately, sig_prepare_pkru() does just that: wrpkru(0x0). This writes XSTATE_BV[PKRU]=0 which makes XRSTOR ignore the PKRU value in the sigframe. To fix this, always overwrite the sigframe XSTATE_BV with a value that has XSTATE_BV[PKRU]==1. This ensures that XRSTOR will not ignore what update_pkru_in_sigframe() wrote. The problematic sequence of events is something like this: Userspace does: * wrpkru(0xffff0000) (or whatever) * Hardware sets: XINUSE[PKRU]=1 Signal happens, kernel is entered: * sig_prepare_pkru() => wrpkru(0x00000000) * Hardware sets: XINUSE[PKRU]=0 (aggressive AMD init tracker) * XSAVE writes most of XSAVE buffer, including XSTATE_BV[PKRU]=XINUSE[PKRU]=0 * update_pkru_in_sigframe() overwrites PKRU in XSAVE buffer ... signal handling * XRSTOR sees XSTATE_BV[PKRU]==0, ignores just-written value from update_pkru_in_sigframe() Fixes: 70044df250d0 ("x86/pkeys: Update PKRU to enable all pkeys before XSAVE") Suggested-by: Rudi Horn <rudi.horn@oracle.com> Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-3-aruna.ramakrishna%40oracle.com
2024-12-02x86/pkeys: Change caller of update_pkru_in_sigframe()Aruna Ramakrishna
update_pkru_in_sigframe() will shortly need some information which is only available inside xsave_to_user_sigframe(). Move update_pkru_in_sigframe() inside the other function to make it easier to provide it that information. No functional changes. Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-2-aruna.ramakrishna%40oracle.com
2024-12-02x86: Convert unreachable() to BUG()Peter Zijlstra
Avoid unreachable() as it can (and will in the absence of UBSAN) generate fallthrough code. Use BUG() so we get a UD2 trap (with unreachable annotation). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org