summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2025-01-21Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21Merge tag 'ftrace-v6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace updates from Steven Rostedt: - Have fprobes built on top of function graph infrastructure The fprobe logic is an optimized kprobe that uses ftrace to attach to functions when a probe is needed at the start or end of the function. The fprobe and kretprobe logic implements a similar method as the function graph tracer to trace the end of the function. That is to hijack the return address and jump to a trampoline to do the trace when the function exits. To do this, a shadow stack needs to be created to store the original return address. Fprobes and function graph do this slightly differently. Fprobes (and kretprobes) has slots per callsite that are reserved to save the return address. This is fine when just a few points are traced. But users of fprobes, such as BPF programs, are starting to add many more locations, and this method does not scale. The function graph tracer was created to trace all functions in the kernel. In order to do this, when function graph tracing is started, every task gets its own shadow stack to hold the return address that is going to be traced. The function graph tracer has been updated to allow multiple users to use its infrastructure. Now have fprobes be one of those users. This will also allow for the fprobe and kretprobe methods to trace the return address to become obsolete. With new technologies like CFI that need to know about these methods of hijacking the return address, going toward a solution that has only one method of doing this will make the kernel less complex. - Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Remove disabling of interrupts in the function graph tracer When function graph tracer was first introduced, it could race with interrupts and NMIs. To prevent that race, it would disable interrupts and not trace NMIs. But the code has changed to allow NMIs and also interrupts. This change was done a long time ago, but the disabling of interrupts was never removed. Remove the disabling of interrupts in the function graph tracer is it is not needed. This greatly improves its performance. - Allow the :mod: command to enable tracing module functions on the kernel command line. The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Because enabling function tracing can be done very early at boot up (before scheduling is enabled), the commands that can be done when function tracing is started is limited. Having the ":mod:" command to trace module functions as they are loaded is very useful. Update the kernel command line function filtering to allow it. * tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits) ftrace: Implement :mod: cache filtering on kernel command line tracing: Adopt __free() and guard() for trace_fprobe.c bpf: Use ftrace_get_symaddr() for kprobe_multi probes ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr Documentation: probes: Update fprobe on function-graph tracer selftests/ftrace: Add a test case for repeating register/unregister fprobe selftests: ftrace: Remove obsolate maxactive syntax check tracing/fprobe: Remove nr_maxactive from fprobe fprobe: Add fprobe_header encoding feature fprobe: Rewrite fprobe on function-graph tracer s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS tracing: Add ftrace_fill_perf_regs() for perf event tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs fprobe: Use ftrace_regs in fprobe exit handler fprobe: Use ftrace_regs in fprobe entry handler fgraph: Pass ftrace_regs to retfunc fgraph: Replace fgraph_ret_regs with ftrace_regs ...
2025-01-21Merge tag 'irq-core-2025-01-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: - Consolidate the machine_kexec_mask_interrupts() by providing a generic implementation and replacing the copy & pasta orgy in the relevant architectures. - Prevent unconditional operations on interrupt chips during kexec shutdown, which can trigger warnings in certain cases when the underlying interrupt has been shut down before. - Make the enforcement of interrupt handling in interrupt context unconditionally available, so that it actually works for non x86 related interrupt chips. The earlier enablement for ARM GIC chips set the required chip flag, but did not notice that the check was hidden behind a config switch which is not selected by ARM[64]. - Decrapify the handling of deferred interrupt affinity setting. Some interrupt chips require that affinity changes are made from the context of handling an interrupt to avoid certain race conditions. For x86 this was the default, but with interrupt remapping this requirement was lifted and a flag was introduced which tells the core code that affinity changes can be done in any context. Unrestricted affinity changes are the default for the majority of interrupt chips. RISCV has the requirement to add the deferred mode to one of it's interrupt controllers, but with the original implementation this would require to add the any context flag to all other RISC-V interrupt chips. That's backwards, so reverse the logic and require that chips, which need the deferred mode have to be marked accordingly. That avoids chasing the 'sane' chips and marking them. - Add multi-node support to the Loongarch AVEC interrupt controller driver. - The usual tiny cleanups, fixes and improvements all over the place. * tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set() genirq/timings: Add kernel-doc for a function parameter genirq: Remove IRQ_MOVE_PCNTXT and related code x86/apic: Convert to IRQCHIP_MOVE_DEFERRED genirq: Provide IRQCHIP_MOVE_DEFERRED hexagon: Remove GENERIC_PENDING_IRQ leftover ARC: Remove GENERIC_PENDING_IRQ genirq: Remove handle_enforce_irqctx() wrapper genirq: Make handle_enforce_irqctx() unconditionally available irqchip/loongarch-avec: Add multi-nodes topology support irqchip/ts4800: Replace seq_printf() by seq_puts() irqchip/ti-sci-inta : Add module build support irqchip/ti-sci-intr: Add module build support irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown kexec: Consolidate machine_kexec_mask_interrupts() implementation genirq: Reuse irq_thread_fn() for forced thread case genirq: Move irq_thread_fn() further up in the code
2025-01-21Merge tag 'sched-core-2025-01-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_FAIR) enhancements: - Behavioral improvements: - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra) - Delayed-dequeue enhancements & fixes: (Vincent Guittot) - Rename h_nr_running into h_nr_queued - Add new cfs_rq.h_nr_runnable - Use the new cfs_rq.h_nr_runnable - Removed unsued cfs_rq.h_nr_delayed - Rename cfs_rq.idle_h_nr_running into h_nr_idle - Remove unused cfs_rq.idle_nr_running - Rename cfs_rq.nr_running into nr_queued - Do not try to migrate delayed dequeue task - Fix variable declaration position - Encapsulate set custom slice in a __setparam_fair() function - Fixes: - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding) - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia) - Cleanups: - Clean up in migrate_degrades_locality() to improve readability (Peter Zijlstra) - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko) - Update comments after sched_tick() rename (Sebastian Andrzej Siewior) - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used() (Valentin Schneider) Deadline scheduler (SCHED_DL) enhancements: - Restore dl_server bandwidth on non-destructive root domain changes (Juri Lelli) - Correctly account for allocated bandwidth during hotplug (Juri Lelli) - Check bandwidth overflow earlier for hotplug (Juri Lelli) - Clean up goto label in pick_earliest_pushable_dl_task() (John Stultz) - Consolidate timer cancellation (Wander Lairson Costa) Load-balancer enhancements: - Improve performance by prioritizing migrating eligible tasks in sched_balance_rq() (Hao Jia) - Do not compute NUMA Balancing stats unnecessarily during load-balancing (K Prateek Nayak) - Do not compute overloaded status unnecessarily during load-balancing (K Prateek Nayak) Generic scheduling code enhancements: - Use READ_ONCE() in task_on_rq_queued(), to consistently use the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal) Isolated CPUs support enhancements: (Waiman Long) - Make "isolcpus=nohz" equivalent to "nohz_full" - Consolidate housekeeping cpumasks that are always identical - Remove HK_TYPE_SCHED - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE RSEQ enhancements: - Validate read-only fields under DEBUG_RSEQ config (Mathieu Desnoyers) PSI enhancements: - Fix race when task wakes up before psi_sched_switch() adjusts flags (Chengming Zhou) IRQ time accounting performance enhancements: (Yafang Shao) - Define sched_clock_irqtime as static key - Don't account irq time if sched_clock_irqtime is disabled Virtual machine scheduling enhancements: - Don't try to catch up excess steal time (Suleiman Souhlal) Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak) - Convert "sysctl_sched_itmt_enabled" to boolean - Use guard() for itmt_update_mutex - Move the "sched_itmt_enabled" sysctl to debugfs - Remove x86_smt_flags and use cpu_smt_flags directly - Use x86_sched_itmt_flags for PKG domain unconditionally Debugging code & instrumentation enhancements: - Change need_resched warnings to pr_err() (David Rientjes) - Print domain name in /proc/schedstat (K Prateek Nayak) - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra) - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal) - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal) - Update Schedstat version to 17 (Swapnil Sapkal)" * tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) rseq: Fix rseq unregistration regression psi: Fix race when task wakes up before psi_sched_switch() adjusts flags sched, psi: Don't account irq time if sched_clock_irqtime is disabled sched: Don't account irq time if sched_clock_irqtime is disabled sched: Define sched_clock_irqtime as static key sched/fair: Do not compute overloaded status unnecessarily during lb sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs x86/itmt: Use guard() for itmt_update_mutex x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean sched/core: Prioritize migrating eligible tasks in sched_balance_rq() sched/debug: Change need_resched warnings to pr_err sched/fair: Encapsulate set custom slice in a __setparam_fair() function sched: Fix race between yield_to() and try_to_wake_up() docs: Update Schedstat version to 17 sched/stats: Print domain name in /proc/schedstat sched: Move sched domain name out of CONFIG_SCHED_DEBUG sched: Report the different kinds of imbalances in /proc/schedstat ...
2025-01-21Merge tag 'x86-cleanups-2025-01-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Miscellaneous x86 cleanups and typo fixes, and also the removal of the 'disablelapic' boot parameter" * tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioapic: Remove a stray tab in the IO-APIC type string x86/cpufeatures: Remove "AMD" from the comments to the AMD-specific leaf Documentation/kernel-parameters: Fix a typo in kvm.enable_virt_at_load text x86/cpu: Fix typo in x86_match_cpu()'s doc x86/apic: Remove "disablelapic" cmdline option Documentation: Merge x86-specific boot options doc into kernel-parameters.txt x86/ioremap: Remove unused size parameter in remapping functions x86/ioremap: Simplify setup_data mapping variants x86/boot/compressed: Remove unused header includes from kaslr.c
2025-01-21Merge tag 'perf-core-2025-01-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Seqlock optimizations that arose in a perf context and were merged into the perf tree: - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan) - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan) - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan) - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra) Core perf enhancements: - Reduce 'struct page' footprint of perf by mapping pages in advance (Lorenzo Stoakes) - Save raw sample data conditionally based on sample type (Yabin Cui) - Reduce sampling overhead by checking sample_type in perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui) - Export perf_exclude_event() (Namhyung Kim) Uprobes scalability enhancements: (Andrii Nakryiko) - Simplify find_active_uprobe_rcu() VMA checks - Add speculative lockless VMA-to-inode-to-uprobe resolution - Simplify session consumer tracking - Decouple return_instance list traversal and freeing - Ensure return_instance is detached from the list before freeing - Reuse return_instances between multiple uretprobes within task - Guard against kmemdup() failing in dup_return_instance() AMD core PMU driver enhancements: - Relax privilege filter restriction on AMD IBS (Namhyung Kim) AMD RAPL energy counters support: (Dhananjay Ugwekar) - Introduce topology_logical_core_id() (K Prateek Nayak) - Remove the unused get_rapl_pmu_cpumask() function - Remove the cpu_to_rapl_pmu() function - Rename rapl_pmu variables - Make rapl_model struct global - Add arguments to the init and cleanup functions - Modify the generic variable names to *_pkg* - Remove the global variable rapl_msrs - Move the cntr_mask to rapl_pmus struct - Add core energy counter support for AMD CPUs Intel core PMU driver enhancements: - Support RDPMC 'metrics clear mode' feature (Kan Liang) - Clarify adaptive PEBS processing (Kan Liang) - Factor out functions for PEBS records processing (Kan Liang) - Simplify the PEBS records processing for adaptive PEBS (Kan Liang) Intel uncore driver enhancements: (Kan Liang) - Convert buggy pmu->func_id use to pmu->registered - Support more units on Granite Rapids" * tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) perf: map pages in advance perf/x86/intel/uncore: Support more units on Granite Rapids perf/x86/intel/uncore: Clean up func_id perf/x86/intel: Support RDPMC metrics clear mode uprobes: Guard against kmemdup() failing in dup_return_instance() perf/x86: Relax privilege filter restriction on AMD IBS perf/core: Export perf_exclude_event() uprobes: Reuse return_instances between multiple uretprobes within task uprobes: Ensure return_instance is detached from the list before freeing uprobes: Decouple return_instance list traversal and freeing uprobes: Simplify session consumer tracking uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution uprobes: simplify find_active_uprobe_rcu() VMA checks mm: introduce mmap_lock_speculate_{try_begin|retry} mm: convert mm_lock_seq to a proper seqcount mm/gup: Use raw_seqcount_try_begin() seqlock: add raw_seqcount_try_begin perf/x86/rapl: Add core energy counter support for AMD CPUs perf/x86/rapl: Move the cntr_mask to rapl_pmus struct perf/x86/rapl: Remove the global variable rapl_msrs ...
2025-01-21Merge tag 'objtool-core-2025-01-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Introduce the generic section-based annotation infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra) - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra) - ANNOTATE_NOENDBR - ANNOTATE_RETPOLINE_SAFE - instrumentation_{begin,end}() - VALIDATE_UNRET_BEGIN - ANNOTATE_IGNORE_ALTERNATIVE - ANNOTATE_INTRA_FUNCTION_CALL - {.UN}REACHABLE - Optimize the annotation-sections parsing code (Peter Zijlstra) - Centralize annotation definitions in <linux/objtool.h> - Unify & simplify the barrier_before_unreachable()/unreachable() definitions (Peter Zijlstra) - Convert unreachable() calls to BUG() in x86 code, as unreachable() has unreliable code generation (Peter Zijlstra) - Remove annotate_reachable() and annotate_unreachable(), as it's unreliable against compiler optimizations (Peter Zijlstra) - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra) - Robustify the annotation code by warning about unknown annotation types (Peter Zijlstra) - Allow arch code to discover jump table size, in preparation of annotated jump table support (Ard Biesheuvel) * tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Convert unreachable() to BUG() objtool: Allow arch code to discover jump table size objtool: Warn about unknown annotation types objtool: Fix ANNOTATE_REACHABLE to be a normal annotation objtool: Convert {.UN}REACHABLE to ANNOTATE objtool: Remove annotate_{,un}reachable() loongarch: Use ASM_REACHABLE x86: Convert unreachable() to BUG() unreachable: Unify objtool: Collect more annotations in objtool.h objtool: Collapse annotate sequences objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE objtool: Convert instrumentation_{begin,end}() to ANNOTATE objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE objtool: Convert ANNOTATE_NOENDBR to ANNOTATE objtool: Generic annotation infrastructure
2025-01-21Merge tag 'x86_misc_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - The first part of a restructuring of AMD's representation of a northbridge which is legacy now, and the creation of the new AMD node concept which represents the Zen architecture of having a collection of I/O devices within an SoC. Those nodes comprise the so-called data fabric on Zen. This has at least one practical advantage of not having to add a PCI ID each time a new data fabric PCI device releases. Eventually, the lot more uniform provider of data fabric functionality amd_node.c will be used by all the drivers which need it - Smaller cleanups * tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_node: Use defines for SMN register offsets x86/amd_node: Remove dependency on AMD_NB x86/amd_node: Update __amd_smn_rw() error paths x86/amd_nb: Move SMN access code to a new amd_node driver x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id() x86/amd_nb: Simplify function 3 search x86/amd_nb: Use topology info to get AMD node count x86/amd_nb: Simplify root device search x86/amd_nb: Simplify function 4 search x86: Start moving AMD node functionality out of AMD_NB x86/amd_nb: Clean up early_is_amd_nb() x86/amd_nb: Restrict init function to AMD-based systems x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21Merge tag 'x86_cpu_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Remove the less generic CPU matching infra around struct x86_cpu_desc and use the generic struct x86_cpu_id thing - Remove magic naked numbers for CPUID functions and use proper defines of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree - Smaller cleanups and improvements * tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Make all all CPUID leaf names consistent x86/fpu: Remove unnecessary CPUID level check x86/fpu: Move CPUID leaf definitions to common code x86/tsc: Remove CPUID "frequency" leaf magic numbers. x86/tsc: Move away from TSC leaf magic numbers x86/cpu: Move TSC CPUID leaf definition x86/cpu: Refresh DCA leaf reading code x86/cpu: Remove unnecessary MwAIT leaf checks x86/cpu: Use MWAIT leaf definition x86/cpu: Move MWAIT leaf definition to common header x86/cpu: Remove 'x86_cpu_desc' infrastructure x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id' x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id' x86/cpu: Expose only stepping min/max interface x86/cpu: Introduce new microcode matching helper x86/cpufeature: Document cpu_feature_enabled() as the default to use x86/paravirt: Remove the WBINVD callback x86/cpufeatures: Free up unused feature bits
2025-01-21Merge tag 'x86_sev_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - A segmented Reverse Map table (RMP) is a across-nodes distributed table of sorts which contains per-node descriptors of each node-local 4K page, denoting its ownership (hypervisor, guest, etc) in the realm of confidential computing. Add support for such a table in order to improve referential locality when accessing or modifying RMP table entries - Add support for reading the TSC in SNP guests by removing any interference or influence the hypervisor might have, with the goal of making a confidential guest even more independent from the hypervisor * tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Add the Secure TSC feature for SNP guests x86/tsc: Init the TSC for Secure TSC guests x86/sev: Mark the TSC in a secure TSC guest as reliable x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests x86/sev: Change TSC MSR behavior for Secure TSC enabled guests x86/sev: Add Secure TSC support for SNP guests x86/sev: Relocate SNP guest messaging routines to common code x86/sev: Carve out and export SNP guest messaging init routines virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL virt: sev-guest: Remove is_vmpck_empty() helper x86/sev/docs: Document the SNP Reverse Map Table (RMP) x86/sev: Add full support for a segmented RMP table x86/sev: Treat the contiguous RMP table as a single RMP segment x86/sev: Map only the RMP table entries instead of the full RMP range x86/sev: Move the SNP probe routine out of the way x86/sev: Require the RMPREAD instruction after Zen4 x86/sev: Add support for the RMPREAD instruction x86/sev: Prepare for using the RMPREAD instruction to access the RMP
2025-01-21Merge tag 'x86_microcode_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loader updates from Borislav Petkov: - A bunch of minor cleanups * tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Remove ret local var in early_apply_microcode() x86/microcode/AMD: Have __apply_microcode_amd() return bool x86/microcode/AMD: Make __verify_patch_size() return bool x86/microcode/AMD: Remove bogus comment from parse_container() x86/microcode/AMD: Return bool from find_blobs_in_containers()
2025-01-21Merge tag 'x86_cache_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - Extend resctrl with the capability of total memory bandwidth monitoring, thus accomodating systems which support only total but not local memory bandwidth monitoring. Add the respective new mount options - The usual cleanups * tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Document the new "mba_MBps_event" file x86/resctrl: Add write option to "mba_MBps_event" file x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directories x86/resctrl: Make mba_sc use total bandwidth if local is not supported x86/resctrl: Compute memory bandwidth for all supported events x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group event x86/resctrl: Prepare for per-CTRL_MON group mba_MBps control x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags x86/resctrl: Use kthread_run_on_cpu()
2025-01-21Merge tag 'x86_bugs_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU speculation update from Borislav Petkov: - Add support for AMD hardware which is not affected by SRSO on the user/kernel attack vector and advertise it to guest userspace * tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace x86/bugs: Add SRSO_USER_KERNEL_NO support
2025-01-21Merge tag 'ras_core_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Remove the shared threshold bank hack on AMD and streamline and simplify it - Cleanup and sanitize MCA code * tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/amd: Remove shared threshold bank plumbing x86/mce: Remove the redundant mce_hygon_feature_init() x86/mce: Convert family/model mixed checks to VFM-based checks x86/mce: Break up __mcheck_cpu_apply_quirks() x86/mce: Make four functions return bool x86/mce/threshold: Remove the redundant this_cpu_dec_return() x86/mce: Make several functions return bool
2025-01-20x86: use cmov for user address maskingLinus Torvalds
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20x86: use proper 'clac' and 'stac' opcode namesLinus Torvalds
Back when we added SMAP support, all versions of binutils didn't necessarily understand the 'clac' and 'stac' instructions. So we implemented those instructions manually as ".byte" sequences. But we've since upgraded the minimum version of binutils to version 2.25, and that included proper support for the SMAP instructions, and there's no reason for us to use some line noise to express them any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-19Merge tag 'x86_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Mark serialize() noinstr so that it can be used from instrumentation- free code - Make sure FRED's RSP0 MSR is synchronized with its corresponding per-CPU value in order to avoid double faults in hotplug scenarios - Disable EXECMEM_ROX on x86 for now because it didn't receive proper x86 maintainers review, went in and broke a bunch of things * tag 'x86_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Make serialize() always_inline x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache x86: Disable EXECMEM_ROX support
2025-01-16x86/asm: Make serialize() always_inlineJuergen Gross
In order to allow serialize() to be used from noinstr code, make it __always_inline. Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
2025-01-15genirq: Remove IRQ_MOVE_PCNTXT and related codeThomas Gleixner
Now that x86 is converted over to use the IRQCHIP_MOVE_DEFERRED flags, remove IRQ*_MOVE_PCNTXT and related code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210103335.626707225@linutronix.de
2025-01-15x86/apic: Convert to IRQCHIP_MOVE_DEFERREDThomas Gleixner
Instead of marking individual interrupts as safe to be migrated in arbitrary contexts, mark the interrupt chips, which require the interrupt to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED flag. This makes more sense because this is a per interrupt chip property and not restricted to individual interrupts. That flips the logic from the historical opt-out to a opt-in model. This is simpler to handle for other architectures, which default to unrestricted affinity setting. It also allows to cleanup the redundant core logic significantly. All interrupt chips, which belong to a top-level domain sitting directly on top of the x86 vector domain are marked accordingly, unless the related setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2025-01-14x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cacheXin Li (Intel)
The FRED RSP0 MSR is only used for delivering events when running userspace. Linux leverages this property to reduce expensive MSR writes and optimize context switches. The kernel only writes the MSR when about to run userspace *and* when the MSR has actually changed since the last time userspace ran. This optimization is implemented by maintaining a per-CPU cache of FRED RSP0 and then checking that against the value for the top of current task stack before running userspace. However cpu_init_fred_exceptions() writes the MSR without updating the per-CPU cache. This means that the kernel might return to userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the top of current task stack. This would induce a double fault (#DF), which is bad. A context switch after cpu_init_fred_exceptions() can paper over the issue since it updates the cached value. That evidently happens most of the time explaining how this bug got through. Fix the bug through resynchronizing the FRED RSP0 MSR with its per-CPU cache in cpu_init_fred_exceptions(). Fixes: fe85ee391966 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch") Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
2025-01-13Merge tag 'mm-hotfixes-stable-2025-01-13-00-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 11 are cc:stable. 13 are MM and 5 are non-MM. All patches are singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-13-00-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: fs/proc: fix softlockup in __read_vmcore (part 2) mm: fix assertion in folio_end_read() mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled. vmstat: disable vmstat_work on vmstat_cpu_down_prep() zram: fix potential UAF of zram table selftests/mm: set allocated memory to non-zero content in cow test mm: clear uffd-wp PTE/PMD state on mremap() module: fix writing of livepatch relocations in ROX text mm: zswap: properly synchronize freeing resources during CPU hotunplug Revert "mm: zswap: fix race between [de]compression and CPU hotunplug" hugetlb: fix NULL pointer dereference in trace_hugetlbfs_alloc_inode mm: fix div by zero in bdi_ratio_from_pages x86/execmem: fix ROX cache usage in Xen PV guests filemap: avoid truncating 64-bit offset to 32 bits tools: fix atomic_set() definition to set the value correctly mm/mempolicy: count MPOL_WEIGHTED_INTERLEAVE to "interleave_hit" scripts/decode_stacktrace.sh: fix decoding of lines with an additional info mm/kmemleak: fix percpu memory leak detection failure
2025-01-13x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionallyK Prateek Nayak
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is enabled by the system. Without ITMT support being enabled, it returns 0 similar to current x86_die_flags() on non-Hybrid systems (!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES) On Intel systems that enable ITMT support, either the MC domain coincides with the PKG domain, or in case of multiple MC groups within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the processor features Hybrid core layout (X86_HYBRID_CPU) which leads to three distinct possibilities: o If PKG and MC domains coincide, PKG domain is degenerated by sd_parent_degenerate() when building sched domain topology. o If SNC is enabled, PKG domain is never added since "x86_has_numa_in_package" is set and the topology will instead contain NODE and NUMA domains. o On X86_HYBRID_CPU which contains multiple MC groups within the PKG, the PKG domain requires x86_sched_itmt_flags(). Thus, on Intel systems that contains multiple MC groups within the PKG and enables ITMT support, the PKG domain requires x86_sched_itmt_flags(). In all other cases PKG domain is either never added or is degenerated. Thus, returning x86_sched_itmt_flags() unconditionally at PKG domain on Intel systems should not lead to any functional changes. On AMD systems with multiple LLCs (MC groups) within a PKG domain, enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain since the core rankings are assigned PKG-wide. Core rankings on AMD processors is currently set by the amd-pstate driver when Preferred Core feature is supported. A subset of systems that support Preferred Core feature can be detected using X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the systems that support Preferred Core ranking. Detecting Preferred Core support on AMD systems requires inspecting CPPC Highest Perf on all present CPUs and checking if it differs on at least one CPU. Previous suggestion to use a synthetic feature to detect Preferred Core support [1] was found to be non-trivial to implement since BSP alone cannot detect if Preferred Core is supported and by the time AP comes up, alternatives are patched and setting a X86_FEATURE_* then is not possible. Since x86 processors enabling ITMT support that consists multiple non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the PKG domain, return x86_sched_itmt_flags unconditionally for the PKG domain. Since x86_die_flags() would have just returned x86_sched_itmt_flags() after the change, remove the unnecessary wrapper and pass x86_sched_itmt_flags() directly as the flags function. Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
2025-01-13x86/topology: Remove x86_smt_flags and use cpu_smt_flags directlyK Prateek Nayak
x86_*_flags() wrappers were introduced with commit d3d37d850d1d ("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add x86_sched_itmt_flags() in addition to the default domain flags for SMT and MC domain. commit 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags") removed the ITMT flags for SMT domain but not the x86_smt_flags() wrappers which directly returns cpu_smt_flags(). Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the flags for SMT domain. No functional changes intended. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
2025-01-13x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfsK Prateek Nayak
"sched_itmt_enabled" was only introduced as a debug toggle for any funky ITMT behavior. Move the sysctl controlled from "/proc/sys/kernel/sched_itmt_enabled" to debugfs at "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a cat on the file will return "Y" or "N" instead of "1" or "0" to indicate that feature is enabled or disabled respectively. Either "0" or "N" (or any string that kstrtobool() interprets as false) can be written to the file will disable the feature, and writing either "1" or "Y" (or any string that kstrtobool() interprets as true) will enable it back when the platform supports ITMT ranking. Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the toggle was moved to "/sys/kernel/debug/x86/" as opposed to "/sys/kernel/debug/sched/" Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
2025-01-13x86/itmt: Use guard() for itmt_update_mutexK Prateek Nayak
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock() in the bailout and return paths. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
2025-01-13x86/itmt: Convert "sysctl_sched_itmt_enabled" to booleanK Prateek Nayak
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert the unsigned int to bool since debugfs readily exposes boolean fops primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can streamline the conversion. Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled" can only be 0 or 1 and this datatype conversion should not cause any functional changes. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
2025-01-13x86: Disable EXECMEM_ROX supportPeter Zijlstra
The whole module_writable_address() nonsense made a giant mess of alternative.c, not to mention it still contains bugs -- notable some of the CFI variants crash and burn. Mike has been working on patches to clean all this up again, but given the current state of things, this stuff just isn't ready. Disable for now, lets try again next cycle. Fixes: 5185e7f9f3bd ("x86/module: enable ROX caches for module text on 64 bit") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250113112934.GA8385@noisy.programming.kicks-ass.net
2025-01-12x86/execmem: fix ROX cache usage in Xen PV guestsJuergen Gross
The recently introduced ROX cache for modules is assuming large page support in 64-bit mode without testing the related feature bit. This results in breakage when running as a Xen PV guest, as in this mode large pages are not supported. Fix that by testing the X86_FEATURE_PSE capability when deciding whether to enable the ROX cache. Link: https://lkml.kernel.org/r/20250103065631.26459-1-jgross@suse.com Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-10perf/x86/intel/uncore: Support more units on Granite RapidsKan Liang
The same CXL PMONs support is also avaiable on GNR. Apply spr_uncore_cxlcm and spr_uncore_cxldp to GNR as well. The other units were broken on early HW samples, so they were ignored in the early enabling patch. The issue has been fixed and verified on the later production HW. Add UPI, B2UPI, B2HOT, PCIEX16 and PCIEX8 for GNR. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Eric Hu <eric.hu@intel.com> Link: https://lkml.kernel.org/r/20250108143017.1793781-2-kan.liang@linux.intel.com
2025-01-10perf/x86/intel/uncore: Clean up func_idKan Liang
The below warning may be triggered on GNR when the PCIE uncore units are exposed. WARNING: CPU: 4 PID: 1 at arch/x86/events/intel/uncore.c:1169 uncore_pci_pmu_register+0x158/0x190 The current uncore driver assumes that all the devices in the same PMU have the exact same devfn. It's true for the previous platforms. But it doesn't work for the new PCIE uncore units on GNR. The assumption doesn't make sense. There is no reason to limit the devices from the same PMU to the same devfn. Also, the current code just throws the warning, but still registers the device. The WARN_ON_ONCE() should be removed. The func_id is used by the later event_init() to check if a event->pmu has valid devices. For cpu and mmio uncore PMUs, they are always valid. For pci uncore PMUs, it's set when the PMU is registered. It can be replaced by the pmu->registered. Clean up the func_id. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Eric Hu <eric.hu@intel.com> Link: https://lkml.kernel.org/r/20250108143017.1793781-1-kan.liang@linux.intel.com
2025-01-09x86/sev: Add the Secure TSC feature for SNP guestsNikunj A Dadhania
Now that all the required plumbing is done for enabling Secure TSC, add it to the SNP features present list. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Peter Gonda <pgonda@google.com> Link: https://lore.kernel.org/r/20250106124633.1418972-14-nikunj@amd.com
2025-01-08x86/tsc: Init the TSC for Secure TSC guestsNikunj A Dadhania
Use the GUEST_TSC_FREQ MSR to discover the TSC frequency instead of relying on kvm-clock based frequency calibration. Override both CPU and TSC frequency calibration callbacks with securetsc_get_tsc_khz(). Since the difference between CPU base and TSC frequency does not apply in this case, the same callback is being used. [ bp: Carve out from https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com ] Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08x86/amd_node: Use defines for SMN register offsetsYazen Ghannam
There are more than one SMN index/data pair available for software use. The register offsets are different, but the protocol is the same. Use defines for the SMN offset values and allow the index/data offsets to be passed to the read/write helper function. This eases code reuse with other SMN users in the kernel. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-14-yazen.ghannam@amd.com
2025-01-08x86/amd_node: Remove dependency on AMD_NBYazen Ghannam
Cache the root devices locally so that there are no more dependencies on AMD_NB. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-13-yazen.ghannam@amd.com
2025-01-08x86/amd_node: Update __amd_smn_rw() error pathsYazen Ghannam
Use guard(mutex) and convert PCI error codes to common ones. Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-12-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Move SMN access code to a new amd_node driverMario Limonciello
SMN access was bolted into amd_nb mostly as convenience. This has limitations though that require incurring tech debt to keep it working. Move SMN access to the newly introduced AMD Node driver. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> # pdx86 Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> # PMF, PMC Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com
2025-01-08x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()Mario Limonciello
amd_pci_dev_to_node_id() tries to find the AMD node ID of a device by searching and counting devices. The AMD node ID of an AMD node device is simply its slot number minus the AMD node 0 slot number. Simplify this function and move it to k10temp.c. [ Yazen: Update commit message and simplify function. ] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20241206161210.163701-10-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Simplify function 3 searchYazen Ghannam
Use the newly introduced helper function to look up "function 3". Drop unused PCI IDs and code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250107222847.3300430-8-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Use topology info to get AMD node countYazen Ghannam
Currently, the total AMD node count is determined by searching and counting CPU/node devices using PCI IDs. However, AMD node information is already available through topology CPUID/MSRs. The recent topology rework has made this info easier to access. Replace the node counting code with a simple product of topology info. Every node/northbridge is expected to have a 'misc' device. Clear everything out if a 'misc' device isn't found on a node. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250107222847.3300430-7-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Simplify root device searchYazen Ghannam
The "root" device search was introduced to support SMN access for Zen systems. This device represents a PCIe root complex. It is not the same as the "CPU/node" devices found at slots 0x18-0x1F. There may be multiple PCIe root complexes within an AMD node. Such is the case with server or High-end Desktop (HEDT) systems, etc. Therefore it is not enough to assume "root <-> AMD node" is a 1-to-1 association. Currently, this is handled by skipping "extra" root complexes during the search. However, the hardware provides the PCI bus number of an AMD node's root device. Use the hardware info to get the root device's bus and drop the extra search code and PCI IDs. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-7-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Simplify function 4 searchYazen Ghannam
Use the newly added helper function to look up a CPU/Node function to find "function 4" devices. Thus, avoid the need to regularly add new PCI IDs for basic discovery. The unique PCI IDs are still useful in case of quirks or functional changes. And they should be used only in such a manner. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-6-yazen.ghannam@amd.com
2025-01-08x86: Start moving AMD node functionality out of AMD_NBYazen Ghannam
The "AMD Node" concept spans many families of systems and applies to a number of subsystems and drivers. Currently, the AMD Northbridge code is overloaded with AMD node functionality. However, the node concept is broader than just northbridges. Start files to host common AMD node functions and definitions. Include a helper to find an AMD node device function based on the convention described in AMD documentation. Anything that needs node functionality should include this rather than amd_nb.h. The AMD_NB code will be reduced to only northbridge-specific code needed for legacy systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-5-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Clean up early_is_amd_nb()Yazen Ghannam
The check for early_is_amd_nb() is only useful for systems with GART or the NB_CFG register. Zen-based systems (both AMD and Hygon) have neither, so return early for them. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-4-yazen.ghannam@amd.com
2025-01-08x86/amd_nb: Restrict init function to AMD-based systemsYazen Ghannam
The code implicitly operates on AMD-based systems by matching on PCI IDs. However, the use of these IDs is going away. Add an explicit CPU vendor check instead of relying on PCI IDs. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241206161210.163701-3-yazen.ghannam@amd.com
2025-01-07x86/fpu: Ensure shadow stack is active before "getting" registersRick Edgecombe
The x86 shadow stack support has its own set of registers. Those registers are XSAVE-managed, but they are "supervisor state components" which means that userspace can not touch them with XSAVE/XRSTOR. It also means that they are not accessible from the existing ptrace ABI for XSAVE state. Thus, there is a new ptrace get/set interface for it. The regset code that ptrace uses provides an ->active() handler in addition to the get/set ones. For shadow stack this ->active() handler verifies that shadow stack is enabled via the ARCH_SHSTK_SHSTK bit in the thread struct. The ->active() handler is checked from some call sites of the regset get/set handlers, but not the ptrace ones. This was not understood when shadow stack support was put in place. As a result, both the set/get handlers can be called with XFEATURE_CET_USER in its init state, which would cause get_xsave_addr() to return NULL and trigger a WARN_ON(). The ssp_set() handler luckily has an ssp_active() check to avoid surprising the kernel with shadow stack behavior when the kernel is not ready for it (ARCH_SHSTK_SHSTK==0). That check just happened to avoid the warning. But the ->get() side wasn't so lucky. It can be called with shadow stacks disabled, triggering the warning in practice, as reported by Christina Schimpe: WARNING: CPU: 5 PID: 1773 at arch/x86/kernel/fpu/regset.c:198 ssp_get+0x89/0xa0 [...] Call Trace: <TASK> ? show_regs+0x6e/0x80 ? ssp_get+0x89/0xa0 ? __warn+0x91/0x150 ? ssp_get+0x89/0xa0 ? report_bug+0x19d/0x1b0 ? handle_bug+0x46/0x80 ? exc_invalid_op+0x1d/0x80 ? asm_exc_invalid_op+0x1f/0x30 ? __pfx_ssp_get+0x10/0x10 ? ssp_get+0x89/0xa0 ? ssp_get+0x52/0xa0 __regset_get+0xad/0xf0 copy_regset_to_user+0x52/0xc0 ptrace_regset+0x119/0x140 ptrace_request+0x13c/0x850 ? wait_task_inactive+0x142/0x1d0 ? do_syscall_64+0x6d/0x90 arch_ptrace+0x102/0x300 [...] Ensure that shadow stacks are active in a thread before looking them up in the XSAVE buffer. Since ARCH_SHSTK_SHSTK and user_ssp[SHSTK_EN] are set at the same time, the active check ensures that there will be something to find in the XSAVE buffer. [ dhansen: changelog/subject tweaks ] Fixes: 2fab02b25ae7 ("x86: Add PTRACE interface for shadow stack") Reported-by: Christina Schimpe <christina.schimpe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Christina Schimpe <christina.schimpe@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250107233056.235536-1-rick.p.edgecombe%40intel.com
2025-01-07x86/sev: Mark the TSC in a secure TSC guest as reliableNikunj A Dadhania
In SNP guest environment with Secure TSC enabled, unlike other clock sources (such as HPET, ACPI timer, APIC, etc), the RDTSC instruction is handled without causing a VM exit, resulting in minimal overhead and jitters. Even when the host CPU's TSC is tampered with, the Secure TSC enabled guest keeps on ticking forward. Hence, mark Secure TSC as the only reliable clock source, bypassing unstable calibration. [ bp: Massage. ] Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Peter Gonda <pgonda@google.com> Link: https://lore.kernel.org/r/20250106124633.1418972-10-nikunj@amd.com
2025-01-07x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guestsNikunj A Dadhania
The hypervisor should not be intercepting RDTSC/RDTSCP when Secure TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP instructions are being intercepted. If this should occur and Secure TSC is enabled, guest execution should be terminated as the guest cannot rely on the TSC value provided by the hypervisor. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Peter Gonda <pgonda@google.com> Link: https://lore.kernel.org/r/20250106124633.1418972-9-nikunj@amd.com
2025-01-07x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guestsNikunj A Dadhania
The hypervisor should not be intercepting GUEST_TSC_FREQ MSR(0xcOO10134) when Secure TSC is enabled. A #VC exception will be generated otherwise. If this should occur and Secure TSC is enabled, terminate guest execution. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250106124633.1418972-8-nikunj@amd.com