summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2024-03-13Merge tag 'pm-6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "From the functional perspective, the most significant change here is the addition of support for Energy Models that can be updated dynamically at run time. There is also the addition of LZ4 compression support for hibernation, the new preferred core support in amd-pstate, new platforms support in the Intel RAPL driver, new model-specific EPP handling in intel_pstate and more. Apart from that, the cpufreq default transition delay is reduced from 10 ms to 2 ms (along with some related adjustments), the system suspend statistics code undergoes a significant rework and there is a usual bunch of fixes and code cleanups all over. Specifics: - Allow the Energy Model to be updated dynamically (Lukasz Luba) - Add support for LZ4 compression algorithm to the hibernation image creation and loading code (Nikhil V) - Fix and clean up system suspend statistics collection (Rafael Wysocki) - Simplify device suspend and resume handling in the power management core code (Rafael Wysocki) - Fix PCI hibernation support description (Yiwei Lin) - Make hibernation take set_memory_ro() return values into account as appropriate (Christophe Leroy) - Set mem_sleep_current during kernel command line setup to avoid an ordering issue with handling it (Maulik Shah) - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a driver's system suspend callback (Qingliang Li) - Simplify pm_runtime_get_if_active() usage and add a replacement for pm_runtime_put_autosuspend() (Sakari Ailus) - Add a tracepoint for runtime_status changes tracking (Vilas Bhat) - Fix section title markdown in the runtime PM documentation (Yiwei Lin) - Enable preferred core support in the amd-pstate cpufreq driver (Meng Li) - Fix min_perf assignment in amd_pstate_adjust_perf() and make the min/max limit perf values in amd-pstate always stay within the (highest perf, lowest perf) range (Tor Vic, Meng Li) - Allow intel_pstate to assign model-specific values to strings used in the EPP sysfs interface and make it do so on Meteor Lake (Srinivas Pandruvada) - Drop long-unused cpudata::prev_cummulative_iowait from the intel_pstate cpufreq driver (Jiri Slaby) - Prevent scaling_cur_freq from exceeding scaling_max_freq when the latter is an inefficient frequency (Shivnandan Kumar) - Change default transition delay in cpufreq to 2ms (Qais Yousef) - Remove references to 10ms minimum sampling rate from comments in the cpufreq code (Pierre Gondois) - Honour transition_latency over transition_delay_us in cpufreq (Qais Yousef) - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar) - General enhancements / cleanups to ARM cpufreq drivers (tianyu2, Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova) - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan) - Make the SCMI cpufreq driver get a transition delay value from firmware (Pierre Gondois) - Prevent the haltpoll cpuidle governor from shrinking guest poll_limit_ns below grow_start (Parshuram Sangle) - Avoid potential overflow in integer multiplication when computing cpuidle state parameters (C Cheng) - Adjust MWAIT hint target C-state computation in the ACPI cpuidle driver and in intel_idle to return a correct value for C0 (He Rongguang) - Address multiple issues in the TPMI RAPL driver and add support for new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui) - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel Lezcano) - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li) - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth Norway Ananda) - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil) - Fix a couple of warnings in the OPP core code related to W=1 builds (Viresh Kumar) - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar) - Extend dev_pm_opp_data with turbo support (Sibi Sankar) - dt-bindings: drop maxItems from inner items (David Heidelberg)" * tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits) dt-bindings: opp: drop maxItems from inner items OPP: debugfs: Fix warning around icc_get_name() OPP: debugfs: Fix warning with W=1 builds cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h OPP: Extend dev_pm_opp_data with turbo support Fix cpupower-frequency-info.1 man page typo cpufreq: scmi: Set transition_delay_us firmware: arm_scmi: Populate fast channel rate_limit firmware: arm_scmi: Populate perf commands rate_limit cpuidle: ACPI/intel: fix MWAIT hint target C-state computation PM: sleep: wakeirq: fix wake irq warning in system suspend powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function cpufreq: Don't unregister cpufreq cooling on CPU hotplug PM: suspend: Set mem_sleep_current during kernel command line setup cpufreq: Honour transition_latency over transition_delay_us cpufreq: Limit resolving a frequency to policy min/max Documentation: PM: Fix runtime_pm.rst markdown syntax cpufreq: amd-pstate: adjust min/max limit perf cpufreq: Remove references to 10ms min sampling rate cpufreq: intel_pstate: Update default EPPs for Meteor Lake ...
2024-03-13x86/vmlinux.lds.S: Take __START_KERNEL out conditional definitionWei Yang
If CONFIG_X86_32=y, the section start address is defined to be "LOAD_OFFSET + LOAD_PHYSICAL_ADDR", which is the same as __START_KERNEL_map. Unify it with the 64-bit definition to simplify the code. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240313075839.8321-5-richard.weiyang@gmail.com
2024-03-13x86/vmlinux.lds.S: Remove conditional definition of LOAD_OFFSETWei Yang
In vmlinux.lds.S, we define LOAD_OFFSET conditionally to __PAGE_OFFSET or __START_KERNEL_map. While __START_KERNEL_map is already defined to the same value with the same condition. So it is fine to define LOAD_OFFSET to __START_KERNEL_map directly. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240313075839.8321-4-richard.weiyang@gmail.com
2024-03-12Merge tag 'x86-boot-2024-03-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: - Continuing work by Ard Biesheuvel to improve the x86 early startup code, with the long-term goal to make it position independent: - Get rid of early accesses to global objects, either by moving them to the stack, deferring the access until later, or dropping the globals entirely - Move all code that runs early via the 1:1 mapping into .head.text, and move code that does not out of it, so that build time checks can be added later to ensure that no inadvertent absolute references were emitted into code that does not tolerate them - Remove fixup_pointer() and occurrences of __pa_symbol(), which rely on the compiler emitting absolute references, which is not guaranteed - Improve the early console code - Add early console message about ignored NMIs, so that users are at least warned about their existence - even if we cannot do anything about them - Improve the kexec code's kernel load address handling - Enable more X86S (simplified x86) bits - Simplify early boot GDT handling - Micro-optimize the boot code a bit - Misc cleanups * tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86/sev: Move early startup code into .head.text section x86/sme: Move early SME kernel encryption handling into .head.text x86/boot: Move mem_encrypt= parsing to the decompressor efi/libstub: Add generic support for parsing mem_encrypt= x86/startup_64: Simplify virtual switch on primary boot x86/startup_64: Simplify calculation of initial page table address x86/startup_64: Defer assignment of 5-level paging global variables x86/startup_64: Simplify CR4 handling in startup code x86/boot: Use 32-bit XOR to clear registers efi/x86: Set the PE/COFF header's NX compat flag unconditionally x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[] x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[] x86/boot/64: Use RIP_REL_REF() to access early page tables x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask' x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[] x86/boot/64: Use RIP_REL_REF() to assign 'phys_base' x86/boot/64: Simplify global variable accesses in GDT/IDT programming x86/trampoline: Bypass compat mode in trampoline_start64() if not needed kexec: Allocate kernel above bzImage's pref_address x86/boot: Add a message about ignored early NMIs ...
2024-03-12Merge tag 'rfds-for-linus-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RFDS mitigation from Dave Hansen: "RFDS is a CPU vulnerability that may allow a malicious userspace to infer stale register values from kernel space. Kernel registers can have all kinds of secrets in them so the mitigation is basically to wait until the kernel is about to return to userspace and has user values in the registers. At that point there is little chance of kernel secrets ending up in the registers and the microarchitectural state can be cleared. This leverages some recent robustness fixes for the existing MDS vulnerability. Both MDS and RFDS use the VERW instruction for mitigation" * tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests x86/rfds: Mitigate Register File Data Sampling (RFDS) Documentation/hw-vuln: Add documentation for RFDS x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12Merge branch 'linus' into x86/boot, to resolve conflictIngo Molnar
There's a new conflict with Linus's upstream tree, because in the following merge conflict resolution in <asm/coco.h>: 38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Linus has resolved the conflicting placement of 'cc_mask' better than the original commit: 1c811d403afd x86/sev: Fix position dependent variable references in startup code ... which was also done by an internal merge resolution: 2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree But Linus is right in 38b334fc767e, the 'cc_mask' declaration is sufficient within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block. So instead of forcing Linus to do the same resolution again, merge in Linus's tree and follow his conflict resolution. Conflicts: arch/x86/include/asm/coco.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-12mshyperv: Introduce hv_get_hypervisor_version functionNuno Das Neves
Introduce x86_64 and arm64 functions to get the hypervisor version information and store it in a structure for simpler parsing. Use the new function to get and parse the version at boot time. While at it, move the printing code to hv_common_init() so it is not duplicated. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86-cleanups-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11Merge tag 'x86-build-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Reduce <asm/bootparam.h> dependencies - Simplify <asm/efi.h> - Unify *_setup_data definitions into <asm/setup_data.h> - Reduce the size of <asm/bootparam.h> * tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Do not include <asm/bootparam.h> in several files x86/efi: Implement arch_ima_efi_boot_mode() in source file x86/setup: Move internal setup_data structures into setup_data.h x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11Merge tag 'x86_misc_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Borislav Petkov: - Fix a wrong check in the function reporting whether a CPU executes (or not) a NMI handler - Ratelimit unknown NMIs messages in order to not potentially slow down the machine - Other fixlets * tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Fix the inverse "in NMI handler" check Documentation/maintainer-tip: Add C++ tail comments exception Documentation/maintainer-tip: Add Closes tag x86/nmi: Rate limit unknown NMI messages Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
2024-03-11Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'x86_cache_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull resource control updates from Borislav Petkov: - Rework different aspects of the resctrl code like adding arch-specific accessors and splitting the locking, in order to accomodate ARM's MPAM implementation of hw resource control and be able to use the same filesystem control interface like on x86. Work by James Morse - Improve the memory bandwidth throttling heuristic to handle workloads with not too regular load levels which end up penalized unnecessarily - Use CPUID to detect the memory bandwidth enforcement limit on AMD - The usual set of fixes * tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Remove lockdep annotation that triggers false positive x86/resctrl: Separate arch and fs resctrl locks x86/resctrl: Move domain helper migration into resctrl_offline_cpu() x86/resctrl: Add CPU offline callback for resctrl work x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU x86/resctrl: Add CPU online callback for resctrl work x86/resctrl: Add helpers for system wide mon/alloc capable x86/resctrl: Make rdt_enable_key the arch's decision to switch x86/resctrl: Move alloc/mon static keys into helpers x86/resctrl: Make resctrl_mounted checks explicit x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() x86/resctrl: Allow resctrl_arch_rmid_read() to sleep x86/resctrl: Queue mon_event_read() instead of sending an IPI x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow x86/resctrl: Move CLOSID/RMID matching and setting to use helpers x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding x86/resctrl: Track the number of dirty RMID a CLOSID has x86/resctrl: Allow RMID allocation to be scoped by CLOSID x86/resctrl: Access per-rmid structures by index ...
2024-03-11Merge tag 'x86_mtrr_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MTRR update from Borislav Petkov: - Relax the PAT MSR programming which was unnecessarily using the MTRR programming protocol of disabling the cache around the changes. The reason behind this is the current algorithm triggering a #VE exception for TDX guests and unnecessarily complicating things * tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pat: Simplify the PAT programming protocol
2024-03-11Merge tag 'x86_cpu_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu update from Borislav Petkov: - Have AMD Zen common init code run on all families from Zen1 onwards in order to save some future enablement effort * tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Do the common init on future Zens too
2024-03-11Merge tag 'ras_core_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixlet from Borislav Petkov: - Constify yet another static struct bus_type instance now that the driver core can handle that * tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make mce_subsys const
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11Merge tag 'timers-ptp-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for timekeeping and PTP core. The cross-timestamp mechanism which allows to correlate hardware clocks uses clocksource pointers for describing the correlation. That's suboptimal as drivers need to obtain the pointer, which requires needless exports and exposing internals. This can all be completely avoided by assigning clocksource IDs and using them for describing the correlated clock source. So this adds clocksource IDs to all clocksources in the tree which can be exposed to this mechanism and removes the pointer and now needless exports. A related improvement for the core and the correlation handling has not made it this time, but is expected to get ready for the next round" * tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kvmclock: Unexport kvmclock clocksource treewide: Remove system_counterval_t.cs, which is never read timekeeping: Evaluate system_counterval_t.cs_id instead of .cs ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id x86/tsc: Add clocksource ID, set system_counterval_t.cs_id timekeeping: Add clocksource ID to struct system_counterval_t x86/tsc: Correct kernel-doc notation
2024-03-11Merge tag 'irq-msi-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI updates from Thomas Gleixner: "Updates for the MSI interrupt subsystem and initial RISC-V MSI support. The core changes have been adopted from previous work which converted ARM[64] to the new per device MSI domain model, which was merged to support multiple MSI domain per device. The ARM[64] changes are being worked on too, but have not been ready yet. The core and platform-MSI changes have been split out to not hold up RISC-V and to avoid that RISC-V builds on the scheduled for removal interfaces. The core support provides new interfaces to handle wire to MSI bridges in a straight forward way and introduces new platform-MSI interfaces which are built on top of the per device MSI domain model. Once ARM[64] is converted over the old platform-MSI interfaces and the related ugliness in the MSI core code will be removed. The actual MSI parts for RISC-V were finalized late and have been post-poned for the next merge window. Drivers: - Add a new driver for the Andes hart-level interrupt controller - Rework the SiFive PLIC driver to prepare for MSI suport - Expand the RISC-V INTC driver to support the new RISC-V AIA controller which provides the basis for MSI on RISC-V - A few fixup for the fallout of the core changes" * tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search genirq/matrix: Dynamic bitmap allocation irqchip/riscv-intc: Add support for RISC-V AIA irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe() irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode irqchip/sifive-plic: Use devm_xyz() for managed allocation irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz() irqchip/sifive-plic: Convert PLIC driver into a platform driver irqchip/riscv-intc: Introduce Andes hart-level interrupt controller irqchip/riscv-intc: Allow large non-standard interrupt number genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens irqchip/imx-intmux: Handle pure domain searches correctly genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV genirq/irqdomain: Reroute device MSI create_mapping genirq/msi: Provide allocation/free functions for "wired" MSI interrupts genirq/msi: Optionally use dev->fwnode for device domain genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI ...
2024-03-11x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta
RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is setPawan Gupta
Currently MMIO Stale Data mitigation for CPUs not affected by MDS/TAA is to only deploy VERW at VMentry by enabling mmio_stale_data_clear static branch. No mitigation is needed for kernel->user transitions. If such CPUs are also affected by RFDS, its mitigation may set X86_FEATURE_CLEAR_CPU_BUF to deploy VERW at kernel->user and VMentry. This could result in duplicate VERW at VMentry. Fix this by disabling mmio_stale_data_clear static branch when X86_FEATURE_CLEAR_CPU_BUF is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
2024-03-11Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
2024-03-11Merge tag 'loongarch-kvm-6.9' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.9 * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest.
2024-03-09Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵Paolo Bonzini
https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-08x86/of: Unconditionally call unflatten_and_copy_device_tree()Stephen Boyd
Call this function unconditionally so that we can populate an empty DTB on platforms that don't boot with a firmware provided or builtin DTB. There's no harm in calling unflatten_device_tree() unconditionally here. If there isn't a non-NULL 'initial_boot_params' pointer then unflatten_device_tree() returns early. Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Link: https://lore.kernel.org/r/20240217010557.2381548-5-sboyd@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
2024-03-08x86/sev: Disable KMSAN for memory encryption TUsChangbin Du
Instrumenting sev.c and mem_encrypt_identity.c with KMSAN will result in a triple-faulting kernel. Some of the code is invoked too early during boot, before KMSAN is ready. Disable KMSAN instrumentation for the two translation units. [ bp: Massage commit message. ] Signed-off-by: Changbin Du <changbin.du@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240308044401.1120395-1-changbin.du@huawei.com
2024-03-07x86/fred: Fix init_task thread stack pointer initializationXin Li (Intel)
As TOP_OF_KERNEL_STACK_PADDING was defined as 0 on x86_64, it went unnoticed that the initialization of the .sp field in INIT_THREAD and some calculations in the low level startup code do not take the padding into account. FRED enabled kernels require a 16 byte padding, which means that the init task initialization and the low level startup code use the wrong stack offset. Subtract TOP_OF_KERNEL_STACK_PADDING in all affected places to adjust for this. Fixes: 65c9cc9e2c14 ("x86/fred: Reserve space for the FRED stack frame") Fixes: 3adee777ad0d ("x86/smpboot: Remove initial_stack on 64-bit") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Closes: https://lore.kernel.org/oe-lkp/202402262159.183c2a37-lkp@intel.com Link: https://lore.kernel.org/r/20240304083333.449322-1-xin@zytor.com
2024-03-07x86/kprobes: Boost more instructions from grp2/3/4/5Jinghao Jia
With the instruction decoder, we are now able to decode and recognize instructions with opcode extensions. There are more instructions in these groups that can be boosted: Group 2: ROL, ROR, RCL, RCR, SHL/SAL, SHR, SAR Group 3: TEST, NOT, NEG, MUL, IMUL, DIV, IDIV Group 4: INC, DEC (byte operation) Group 5: INC, DEC (word/doubleword/quadword operation) These instructions are not boosted previously because there are reserved opcodes within the groups, e.g., group 2 with ModR/M.nnn == 110 is unmapped. As a result, kprobes attached to them requires two int3 traps as being non-boostable also prevents jump-optimization. Some simple tests on QEMU show that after boosting and jump-optimization a single kprobe on these instructions with an empty pre-handler runs 10x faster (~1000 cycles vs. ~100 cycles). Since these instructions are mostly ALU operations and do not touch special registers like RIP, let's boost them so that we get the performance benefit. Link: https://lore.kernel.org/all/20240204031300.830475-4-jinghao7@illinois.edu/ Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-07x86/kprobes: Prohibit kprobing on INT and UDJinghao Jia
Both INT (INT n, INT1, INT3, INTO) and UD (UD0, UD1, UD2) serve special purposes in the kernel, e.g., INT3 is used by KGDB and UD2 is involved in LLVM-KCFI instrumentation. At the same time, attaching kprobes on these instructions (particularly UD) will pollute the stack trace dumped in the kernel ring buffer, since the exception is triggered in the copy buffer rather than the original location. Check for INT and UD in can_probe and reject any kprobes trying to attach to these instructions. Link: https://lore.kernel.org/all/20240204031300.830475-3-jinghao7@illinois.edu/ Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-07x86/kprobes: Refactor can_{probe,boost} return type to boolJinghao Jia
Both can_probe and can_boost have int return type but are using int as boolean in their context. Refactor both functions to make them actually return boolean. Link: https://lore.kernel.org/all/20240204031300.830475-2-jinghao7@illinois.edu/ Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-06x86/topology: Ignore non-present APIC IDs in a present packageThomas Gleixner
Borislav reported that one of his systems has a broken MADT table which advertises eight present APICs and 24 non-present APICs in the same package. The non-present ones are considered hot-pluggable by the topology evaluation code, which is obviously bogus as there is no way to hot-plug within the same package. As the topology evaluation code accounts for hot-pluggable CPUs in a package, the maximum number of cores per package is computed wrong, which in turn causes the uncore performance counter driver to access non-existing MSRs. It will probably confuse other entities which rely on the maximum number of cores and threads per package too. Cure this by ignoring hot-pluggable APIC IDs within a present package. In theory it would be reasonable to just do this unconditionally, but then there is this thing called reality^Wvirtualization which ruins everything. Virtualization is the only existing user of "physical" hotplug and the virtualization tools allow the above scenario. Whether that is actually in use or not is unknown. As it can be argued that the virtualization case is not affected by the issues which exposed the reported problem, allow the bogosity if the kernel determined that it is running in a VM for now. Fixes: 89b0f15f408f ("x86/cpu/topology: Get rid of cpuinfo::x86_max_cores") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/87a5nbvccx.ffs@tglx
2024-03-05cpuidle: ACPI/intel: fix MWAIT hint target C-state computationHe Rongguang
According to x86 spec ([1] and [2]), MWAIT hint_address[7:4] plus 1 is the corresponding C-state, and 0xF means C0. ACPI C-state table usually only contains C1+, but nothing prevents ACPI firmware from presenting a C-state (maybe C1+) but using MWAIT address C0 (i.e., 0xF in ACPI FFH MWAIT hint address). And if this is the case, Linux erroneously treat this cstate as C16, while actually this should be valid C0 instead of C16, as per the specifications. Since ACPI firmware is out of Linux kernel scope, fix the kernel handling of 0xF ->(to) C0 in this situation. This is found when a tweaked ACPI C-state table is presented by Qemu to VM. Also modify the intel_idle case for code consistency. [1]. Intel SDM Vol 2, Table 4-11. MWAIT Hints Register (EAX): "Value of 0 means C1; 1 means C2 and so on Value of 01111B means C0". [2]. AMD manual Vol 3, MWAIT: "The processor C-state is EAX[7:4]+1, so to request C0 is to place the value F in EAX[7:4] and to request C1 is to place the value 0 in EAX[7:4].". Signed-off-by: He Rongguang <herongguang@linux.alibaba.com> [ rjw: Subject and changelog edits, whitespace fixups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-03-05ACPI: CPPC: enable AMD CPPC V2 support for family 17h processorsPerry Yuan
As there are some AMD processors which only support CPPC V2 firmware and BIOS implementation, the amd_pstate driver will be failed to load when system booting with below kernel warning message: [ 0.477523] amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled To make the amd_pstate driver can be loaded on those TR40 processors, it needs to match x86_model from 0x30 to 0x7F for family 17H. With the change, the system can load amd_pstate driver as expected. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reported-by: Gino Badouri <badouri.g@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218171 Fixes: fbd74d1689 ("ACPI: CPPC: Fix enabling CPPC on AMD systems with shared memory") Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-03-04x86/sev: Move early startup code into .head.text sectionArd Biesheuvel
In preparation for implementing rigorous build time checks to enforce that only code that can support it will be called from the early 1:1 mapping of memory, move SEV init code that is called in this manner to the .head.text section. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-19-ardb+git@google.com
2024-03-04x86/startup_64: Simplify virtual switch on primary bootArd Biesheuvel
The secondary startup code is used on the primary boot path as well, but in this case, the initial part runs from a 1:1 mapping, until an explicit cross-jump is made to the kernel virtual mapping of the same code. On the secondary boot path, this jump is pointless as the code already executes from the mapping targeted by the jump. So combine this cross-jump with the jump from startup_64() into the common boot path. This simplifies the execution flow, and clearly separates code that runs from a 1:1 mapping from code that runs from the kernel virtual mapping. Note that this requires a page table switch, so hoist the CR3 assignment into startup_64() as well. And since absolute symbol references will no longer be permitted in .head.text once we enable the associated build time checks, a RIP-relative memory operand is used in the JMP instruction, referring to an absolute constant in the .init.rodata section. Given that the secondary startup code does not require a special placement inside the executable, move it to the .text section. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-15-ardb+git@google.com
2024-03-04x86/startup_64: Simplify calculation of initial page table addressArd Biesheuvel
Determining the address of the initial page table to program into CR3 involves: - taking the physical address - adding the SME encryption mask On the primary entry path, the code is mapped using a 1:1 virtual to physical translation, so the physical address can be taken directly using a RIP-relative LEA instruction. On the secondary entry path, the address can be obtained by taking the offset from the virtual kernel base (__START_kernel_map) and adding the physical kernel base. This is implemented in a slightly confusing way, so clean this up. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-14-ardb+git@google.com
2024-03-04x86/startup_64: Defer assignment of 5-level paging global variablesArd Biesheuvel
Assigning the 5-level paging related global variables from the earliest C code using explicit references that use the 1:1 translation of memory is unnecessary, as the startup code itself does not rely on them to create the initial page tables, and this is all it should be doing. So defer these assignments to the primary C entry code that executes via the ordinary kernel virtual mapping. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-13-ardb+git@google.com
2024-03-04x86/startup_64: Simplify CR4 handling in startup codeArd Biesheuvel
When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be changed, and so they can simply be preserved rather than reason about whether or not they need to be set. CR4.MCE should be preserved unless the kernel was built without CONFIG_X86_MCE, in which case it must be cleared. CR4.PSE should be set explicitly, regardless of whether or not it was set before. CR4.PGE is set explicitly, and then cleared and set again after programming CR3 in order to flush TLB entries based on global translations. This makes the first assignment redundant, and can therefore be omitted. So clear PGE by omitting it from the preserve mask, and set it again explicitly after switching to the new page tables. [ bp: Document the exact operation of CR4.PGE ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-12-ardb+git@google.com
2024-03-04x86/idle: Select idle routine only onceThomas Gleixner
The idle routine selection is done on every CPU bringup operation and has a guard in place which is effective after the first invocation, which is a pointless exercise. Invoke it once on the boot CPU and mark the related functions __init. The guard check has to stay as xen_set_default_idle() runs early. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/87edcu6vaq.ffs@tglx
2024-03-04x86/idle: Let prefer_mwait_c1_over_halt() return boolThomas Gleixner
The return value is truly boolean. Make it so. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240229142248.518723854@linutronix.de
2024-03-04x86/idle: Cleanup idle_setup()Thomas Gleixner
Updating the static call for x86_idle() from idle_setup() is counter-intuitive. Let select_idle_routine() handle it like the other idle choices, which allows to simplify the idle selection later on. While at it rewrite comments and return a proper error code and not -1. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240229142248.455616019@linutronix.de
2024-03-04x86/idle: Clean up idle selectionThomas Gleixner
Clean up the code to make it readable. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240229142248.392017685@linutronix.de
2024-03-04x86/idle: Sanitize X86_BUG_AMD_E400 handlingThomas Gleixner
amd_e400_idle(), the idle routine for AMD CPUs which are affected by erratum 400 violates the RCU constraints by invoking tick_broadcast_enter() and tick_broadcast_exit() after the core code has marked RCU non-idle. The functions can end up in lockdep or tracing, which rightfully triggers a RCU warning. The core code provides now a static branch conditional invocation of the broadcast functions. Remove amd_e400_idle(), enforce default_idle() and enable the static branch on affected CPUs to cure this. [ bp: Fold in a fix for a IS_ENABLED() check fail missing a "CONFIG_" prefix which tglx spotted. ] Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/877cim6sis.ffs@tglx
2024-03-04x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variablesThomas Gleixner
Sparse complains rightfully about the usage of EXPORT_SYMBOL_GPL() for per CPU variables: callthunks.c:346:20: sparse: warning: incorrect type in initializer (different address spaces) callthunks.c:346:20: sparse: expected void const [noderef] __percpu *__vpp_verify callthunks.c:346:20: sparse: got unsigned long long * Use EXPORT_PER_CPU_SYMBOL_GPL() instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.841915535@linutronix.de
2024-03-04x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_currentThomas Gleixner
Sparse rightfully complains: bugs.c:71:9: sparse: warning: incorrect type in initializer (different address spaces) bugs.c:71:9: sparse: expected void const [noderef] __percpu *__vpp_verify bugs.c:71:9: sparse: got unsigned long long * The reason is that x86_spec_ctrl_current which is a per CPU variable is exported with EXPORT_SYMBOL_GPL(). Use EXPORT_PER_CPU_SYMBOL_GPL() instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.732288812@linutronix.de
2024-03-04x86/percpu: Cure per CPU madness on UPThomas Gleixner
On UP builds Sparse complains rightfully about accesses to cpu_info with per CPU accessors: cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces) cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify cacheinfo.c:282:30: sparse: got unsigned int * The reason is that on UP builds cpu_info which is a per CPU variable on SMP is mapped to boot_cpu_info which is a regular variable. There is a hideous accessor cpu_data() which tries to hide this, but it's not sufficient as some places require raw accessors and generates worse code than the regular per CPU accessors. Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU cpu_info unconditionally. This requires to update the CPU info on the boot CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function and implement exactly that. This allows to use regular per CPU accessors uncoditionally and paves the way to remove the cpu_data() hackery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.622511517@linutronix.de
2024-03-04smp: Consolidate smp_prepare_boot_cpu()Thomas Gleixner
There is no point in having seven architectures implementing the same empty stub. Provide a weak function in the init code and remove the stubs. This also allows to utilize the function on UP which is required to sanitize the per CPU handling on X86 UP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.567671691@linutronix.de
2024-03-04x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>Thomas Gleixner
To clean up the per CPU insanity of UP which causes sparse to be rightfully unhappy and prevents the usage of the generic per CPU accessors on cpu_info it is necessary to include <linux/percpu.h> into <asm/msr.h>. Including <linux/percpu.h> into <asm/msr.h> is impossible because it ends up in header dependency hell. The problem is that <asm/processor.h> includes <asm/msr.h>. The inclusion of <linux/percpu.h> results in a compile fail where the compiler cannot longer handle an include in <asm/cpufeature.h> which references boot_cpu_data which is defined in <asm/processor.h>. The only reason why <asm/msr.h> is included in <asm/processor.h> are the set/get_debugctlmsr() inlines. They are defined there because <asm/processor.h> is such a nice dump ground for everything. In fact they belong obviously into <asm/debugreg.h>. Move them to <asm/debugreg.h> and fix up the resulting damage which is just exposing the reliance on random include chains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.454678686@linutronix.de
2024-03-04Merge tag 'v6.8-rc7' into x86/cleanups, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>