summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2024-12-18KVM: x86: Unpack F() CPUID feature flag macros to one flag per line of codeSean Christopherson
Refactor kvm_set_cpu_caps() to express each supported (or not) feature flag on a separate line, modulo a handful of cases where KVM does not, and likely will not, support a sequence of flags. This will allow adding fancier macros with longer, more descriptive names without resulting in absurd line lengths and/or weird code. Isolating each flag also makes it far easier to review changes, reduces code conflicts, and generally makes it easier to resolve conflicts. Lastly, it allows co-locating comments for notable flags, e.g. MONITOR, precisely with the relevant flag. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Account for max supported CPUID leaf when getting raw host CPUIDSean Christopherson
Explicitly zero out the feature word in kvm_cpu_caps if the word's associated CPUID function is greater than the max leaf supported by the CPU. For such unsupported functions, Intel CPUs return the output from the last supported leaf, not all zeros. Practically speaking, this is likely a benign bug, as KVM uses the raw host CPUID to mask the kernel's computed capabilities, and the kernel does perform max leaf checks when populating boot_cpu_data. The only way KVM's goof could be problematic is if the kernel force-set a feature in a leaf that is completely unsupported, _and_ the max supported leaf happened to return a value with '1' the same bit position. Which is theoretically possible, but extremely unlikely. And even if that did happen, it's entirely possible that KVM would still provide the correct functionality; the kernel did set the capability after all. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()Sean Christopherson
Do the compile-time sanity checks on reverse_cpuid in __feature_leaf() so that higher level APIs don't need to "manually" perform the sanity checks. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Don't update PV features caches when enabling enforcement capabilitySean Christopherson
Revert the chunk of commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features is initialized when enabling cap") that forced a PV features cache refresh during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering issue it alleged to have fixed never existed upstream, and likely never existed in any kernel. At the time of the commit, there was a tangentially related ioctl() ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2 would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set. But (a) that bug affected the entire guest CPUID, not just the cache, (b) commit 01b4f510b9f4 didn't address that bug, it only refreshed the cache (with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU creation is completely broken as KVM configures HLT-exiting only during vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if vCPUs have been created. Another tangentially related bug was KVM's failure to clear the cache when handling KVM_SET_CPUID2, but again commit 01b4f510b9f4 did nothing to fix that bug. The most plausible explanation for the what commit 01b4f510b9f4 was trying to fix is a bug that existed in Google's internal kernel that was the source of commit 01b4f510b9f4. At the time, Google's internal kernel had not yet picked up commit 0d3b2ba16ba68 ("KVM: X86: Go on updating other CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry. Of course, no sane real world VMM would omit CPUID.0x1, including the KVM selftest added by commit ac4a4d6de22e ("selftests: kvm: test enforcement of paravirtual cpuid features"). And the test didn't actually try to verify multiple orderings, nor did the selftest enter the guest without doing KVM_SET_CPUID2, so who knows what motivated the change. Regardless of why commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features is initialized when enabling cap") was added, refreshing the cache during KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary. Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Zero out PV features cache when the CPUID leaf is not presentSean Christopherson
Clear KVM's PV feature cache prior when processing a new guest CPUID so that KVM doesn't keep a stale cache entry if userspace does KVM_SET_CPUID2 multiple times, once with a PV features entry, and a second time without. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Reject disabling of MWAIT/HLT interception when not allowedSean Christopherson
Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that disabling the exit(s) is not allowed. E.g. because MWAIT isn't supported or the CPU doesn't have an always-running APIC timer, or because KVM is configured to mitigate cross-thread vulnerabilities. Cc: Kechen Lu <kechenl@nvidia.com> Fixes: 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts") Fixes: 6f0f2d5ef895 ("KVM: x86: Mitigate the cross-thread return address predictions bug") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creationSean Christopherson
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless, e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after vCPU creation. vCPUs may also end up with an inconsistent configuration if exits are disabled between creation of multiple vCPUs. Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creationSean Christopherson
Drop the manual initialization of maxphyaddr and reserved_gpa_bits during vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching. None of the helpers between the existing code in kvm_arch_vcpu_create() and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create() isn't exactly easy). Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86/pmu: Drop now-redundant refresh() during init()Sean Christopherson
Drop the manual kvm_pmu_refresh() from kvm_pmu_init() now that kvm_arch_vcpu_create() performs the refresh via kvm_vcpu_after_set_cpuid(). Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Move __kvm_is_valid_cr4() definition to x86.hSean Christopherson
Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits no longer exists, as keeping cr4_reserved_bits local to x86.c was the only reason for "hiding" the definition of __kvm_is_valid_cr4(). No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMXSean Christopherson
Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's reserved bits into the guest's reserved bits. This fixes a bug where VMX's set_cr4_guest_host_mask() fails to account for KVM-reserved bits when deciding which bits can be passed through to the guest. In most cases, letting the guest directly write reserved CR4 bits is ok, i.e. attempting to set the bit(s) will still #GP, but not if a feature is available in hardware but explicitly disabled by the host, e.g. if FSGSBASE support is disabled via "nofsgsbase". Note, the extra overhead of computing host reserved bits every time userspace sets guest CPUID is negligible. The feature bits that are queried are packed nicely into a handful of words, and so checking and setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the total cost will be in the noise even if the number of checked CR4 bits doubles over the next few years. In other words, x86 will run out of CR4 bits long before the overhead becomes problematic. Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in CR4_RESERVED_BITS (and why the new code doesn't do so either). Fixes: 2ed41aa631fc ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Explicitly do runtime CPUID updates "after" initial setupSean Christopherson
Explicitly perform runtime CPUID adjustments as part of the "after set CPUID" flow to guard against bugs where KVM consumes stale vCPU/CPUID state during kvm_update_cpuid_runtime(). E.g. see commit 4736d85f0d18 ("KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT"). Whacking each mole individually is not sustainable or robust, e.g. while the aforemention commit fixed KVM's PV features, the same issue lurks for Xen and Hyper-V features, Xen and Hyper-V simply don't have any runtime features (though spoiler alert, neither should KVM). Updating runtime features in the "full" path will also simplify adding a snapshot of the guest's capabilities, i.e. of caching the intersection of guest CPUID and kvm_cpu_caps (modulo a few edge cases). Link: https://lore.kernel.org/r/20241128013424.4096668-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Do all post-set CPUID processing during vCPU creationSean Christopherson
During vCPU creation, process KVM's default, empty CPUID as if userspace set an empty CPUID to ensure consistent and correct behavior with respect to guest CPUID. E.g. if userspace never sets guest CPUID, KVM will never configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest- visible behavior due to letting the guest set any KVM-supported CR4 bits despite the features not being allowed per guest CPUID. Note! This changes KVM's ABI, as lack of full CPUID processing allowed userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a guest-unsupported value via KVM_SET_SREGS. But it's extremely unlikely that this is a breaking change, as KVM already has many flows that require userspace to set guest CPUID before loading vCPU state. E.g. multiple MSR flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already relies on guest CPUID being up-to-date, as KVM's validity check on CR3 consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR). Furthermore, the plan is to commit to enforcing guest CPUID for userspace writes to MSRs, at which point bypassing sregs CPUID checks is even more nonsensical. Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Limit use of F() and SF() to kvm_cpu_cap_{mask,init_kvm_defined}()Sean Christopherson
Define and undefine the F() and SF() macros precisely around kvm_set_cpu_caps() to make it all but impossible to use the macros outside of kvm_cpu_cap_{mask,init_kvm_defined}(). Currently, F() is a simple passthrough, but SF() is actively dangerous as it checks that the scattered feature is supported by the host kernel. And usage outside of the aforementioned helpers will run afoul of future changes to harden KVM's CPUID management. Opportunistically switch to feature_bit() when stuffing LA57 based on raw hardware support. No functional change intended. Link: https://lore.kernel.org/r/20241128013424.4096668-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Use feature_bit() to clear CONSTANT_TSC when emulating CPUIDSean Christopherson
When clearing CONSTANT_TSC during CPUID emulation due to a Hyper-V quirk, use feature_bit() instead of SF() to ensure the bit is actually cleared. SF() evaluates to zero if the _host_ doesn't support the feature. I.e. KVM could keep the bit set if userspace advertised CONSTANT_TSC despite it not being supported in hardware. Note, translating from a scattered feature to a the hardware version is done by __feature_translate(), not SF(). The sole purpose of SF() is to check kernel support for the scattered feature, *before* translation. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: arm64: Only apply PMCR_EL0.P to the guest range of countersOliver Upton
An important distinction from other registers affected by HPMN is that PMCR_EL0 only affects the guest range of counters, regardless of the EL from which it is accessed. Ensure that PMCR_EL0.P is always applied to 'guest' counters by manually computing the mask rather than deriving it from the current context. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241217175611.3658290-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME changeOliver Upton
MDCR_EL2.HPME is the 'global' enable bit for event counters reserved for EL2. Give the PMU a kick when it's changed to ensure events are reprogrammed before returning to the guest. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241217175550.3658212-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E changeOliver Upton
Nested virt introduces yet another set of 'global' knobs for controlling event counters that are reserved for EL2 (i.e. >= HPMN). Get ready to share some plumbing with the NV controls by offloading counter reprogramming to KVM_REQ_RELOAD_PMU. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241217175532.3658134-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18KVM: arm64: Add unified helper for reprogramming counters by maskOliver Upton
Having separate helpers for enabling/disabling counters provides the wrong abstraction, as the state of each counter needs to be evaluated independently and, in some cases, use a different global enable bit. Collapse the enable/disable accessors into a single, common helper that reconfigures every counter set in @mask, leaving the complexity of determining if an event is actually enabled in kvm_pmu_counter_is_enabled(). Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241217175513.3658056-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18KVM: arm64: Always check the state from hyp_ack_unshare()Quentin Perret
There are multiple pKVM memory transitions where the state of a page is not cross-checked from the completer's PoV for performance reasons. For example, if a page is PKVM_PAGE_OWNED from the initiator's PoV, we should be guaranteed by construction that it is PKVM_NOPAGE for everybody else, hence allowing us to save a page-table lookup. When it was introduced, hyp_ack_unshare() followed that logic and bailed out without checking the PKVM_PAGE_SHARED_BORROWED state in the hypervisor's stage-1. This was correct as we could safely assume that all host-initiated shares were directed at the hypervisor at the time. But with the introduction of other types of shares (e.g. for FF-A or non-protected guests), it is now very much required to cross check this state to prevent the host from running __pkvm_host_unshare_hyp() on a page shared with TZ or a non-protected guest. Thankfully, if an attacker were to try this, the hyp_unmap() call from hyp_complete_unshare() would fail, hence causing to WARN() from __do_unshare() with the host lock held, which is fatal. But this is fragile at best, and can hardly be considered a security measure. Let's just do the right thing and always check the state from hyp_ack_unshare(). Signed-off-by: Quentin Perret <qperret@google.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241128154406.602875-1-qperret@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18Merge tag 'hyperv-fixes-signed-20241217' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Various fixes to Hyper-V tools in the kernel tree (Dexuan Cui, Olaf Hering, Vitaly Kuznetsov) - Fix a bug in the Hyper-V TSC page based sched_clock() (Naman Jain) - Two bug fixes in the Hyper-V utility functions (Michael Kelley) - Convert open-coded timeouts to secs_to_jiffies() in Hyper-V drivers (Easwar Hariharan) * tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: tools/hv: reduce resource usage in hv_kvp_daemon tools/hv: add a .gitignore file tools/hv: reduce resouce usage in hv_get_dns_info helper hv/hv_kvp_daemon: Pass NIC name to hv_get_dns_info as well Drivers: hv: util: Avoid accessing a ringbuffer not initialized yet Drivers: hv: util: Don't force error code to ENODEV in util_probe() tools/hv: terminate fcopy daemon if read from uio fails drivers: hv: Convert open-coded timeouts to secs_to_jiffies() tools: hv: change permissions of NetworkManager configuration file x86/hyperv: Fix hv tsc page based sched_clock for hibernation tools: hv: Fix a complier warning in the fcopy uio daemon
2024-12-18Merge branch 'pci-device-recovery' into featuresAlexander Gordeev
Niklas Schnelle says: =================== This patch series enhances the introspectability of the PCI device recovery for firmware. Until now when Linux performs recovery in response to a firmware error report. For example, until now firmware debug data would have no indication if the recovery was successfull or if it failed, for example due to KVM pass-through. Improve on this by reporting recovery status as well as some debug information such as device driver name and s390dbf/pci_msg/sprintf logs via the SCLP Write Event Data Action Qualifier 2 (Log Data provided) mechanism. =================== Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-12-18x86/cpu: Make all all CPUID leaf names consistentDave Hansen
The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix for consistency and vertical alignment. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
2024-12-18x86/fpu: Remove unnecessary CPUID level checkDave Hansen
The CPUID level dependency table will entirely zap X86_FEATURE_XSAVE if the CPUID level is too low. This code is unreachable. Kill it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://lore.kernel.org/all/20241213205038.6E71F9A4%40davehans-spike.ostc.intel.com
2024-12-18x86/fpu: Move CPUID leaf definitions to common codeDave Hansen
Move the XSAVE-related CPUID leaf definitions to common code. Then, use the new definition to remove the last magic number from the CPUID level dependency table. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205037.43C57CDE%40davehans-spike.ostc.intel.com
2024-12-18x86/tsc: Remove CPUID "frequency" leaf magic numbers.Dave Hansen
All the code that reads the CPUID frequency information leaf hard-codes a magic number. Give it a symbolic name and use it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205036.4397658F%40davehans-spike.ostc.intel.com
2024-12-18x86/tsc: Move away from TSC leaf magic numbersDave Hansen
The TSC code has a bunch of hard-coded references to leaf 0x15. Change them over to the symbolic name. Also zap the 'ART_CPUID_LEAF' definition. It was a duplicate of 'CPUID_TSC_LEAF'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213205034.B79D6224%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Move TSC CPUID leaf definitionDave Hansen
Prepare to use the TSC CPUID leaf definition more widely by moving it to the common header. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205033.68799E53%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Refresh DCA leaf reading codeDave Hansen
The DCA leaf number is also hard-coded in the CPUID level dependency table. Move its definition to common code and use it. While at it, fix up the naming and types in the probe code. All CPUID data is provided in 32-bit registers, not 'unsigned long'. Also stop referring to "level_9". Move away from test_bit() because the type is no longer an 'unsigned long'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205032.476A30FE%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Remove unnecessary MwAIT leaf checksDave Hansen
The CPUID leaf dependency checker will remove X86_FEATURE_MWAIT if the CPUID level is below the required level (CPUID_MWAIT_LEAF). Thus, if you check X86_FEATURE_MWAIT you do not need to also check the CPUID level. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213205030.9B42B458%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Use MWAIT leaf definitionDave Hansen
The leaf-to-feature dependency array uses hard-coded leaf numbers. Use the new common header definition for the MWAIT leaf. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205029.5B055D6E%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Move MWAIT leaf definition to common headerDave Hansen
Begin constructing a common place to keep all CPUID leaf definitions. Move CPUID_MWAIT_LEAF to the CPUID header and include it where needed. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205028.EE94D02A%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Remove 'x86_cpu_desc' infrastructureDave Hansen
All the users of 'x86_cpu_desc' are gone. Zap it from the tree. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185133.AF0BF2BC%40davehans-spike.ostc.intel.com
2024-12-18x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'Dave Hansen
The AMD erratum 1386 detection code uses and old style 'x86_cpu_desc' table. Replace it with 'x86_cpu_id' so the old style can be removed. I did not create a new helper macro here. The new table is certainly more noisy than the old and it can be improved on. But I was hesitant to create a new macro just for a single site that is only two ugly lines in the end. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185132.07555E1D%40davehans-spike.ostc.intel.com
2024-12-18powerpc/pseries/vas: Add close() callback in vas_vm_ops structHaren Myneni
The mapping VMA address is saved in VAS window struct when the paste address is mapped. This VMA address is used during migration to unmap the paste address if the window is active. The paste address mapping will be removed when the window is closed or with the munmap(). But the VMA address in the VAS window is not updated with munmap() which is causing invalid access during migration. The KASAN report shows: [16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8 [16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928 [16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G B 6.11.0-rc5-nxgzip #2 [16386.255128] Tainted: [B]=BAD_PAGE [16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries [16386.255181] Call Trace: [16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable) [16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764 [16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8 [16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0 [16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8 [16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc [16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4 ... [16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s: [16386.256149] kasan_save_stack+0x34/0x68 [16386.256163] kasan_save_track+0x34/0x80 [16386.256175] kasan_save_alloc_info+0x58/0x74 [16386.256196] __kasan_slab_alloc+0xb8/0xdc [16386.256209] kmem_cache_alloc_noprof+0x200/0x3d0 [16386.256225] vm_area_alloc+0x44/0x150 [16386.256245] mmap_region+0x214/0x10c4 [16386.256265] do_mmap+0x5fc/0x750 [16386.256277] vm_mmap_pgoff+0x14c/0x24c [16386.256292] ksys_mmap_pgoff+0x20c/0x348 [16386.256303] sys_mmap+0xd0/0x160 ... [16386.256350] Freed by task 0 on cpu 31 at 16386.204848s: [16386.256363] kasan_save_stack+0x34/0x68 [16386.256374] kasan_save_track+0x34/0x80 [16386.256384] kasan_save_free_info+0x64/0x10c [16386.256396] __kasan_slab_free+0x120/0x204 [16386.256415] kmem_cache_free+0x128/0x450 [16386.256428] vm_area_free_rcu_cb+0xa8/0xd8 [16386.256441] rcu_do_batch+0x2c8/0xcf0 [16386.256458] rcu_core+0x378/0x3c4 [16386.256473] handle_softirqs+0x20c/0x60c [16386.256495] do_softirq_own_stack+0x6c/0x88 [16386.256509] do_softirq_own_stack+0x58/0x88 [16386.256521] __irq_exit_rcu+0x1a4/0x20c [16386.256533] irq_exit+0x20/0x38 [16386.256544] interrupt_async_exit_prepare.constprop.0+0x18/0x2c ... [16386.256717] Last potentially related work creation: [16386.256729] kasan_save_stack+0x34/0x68 [16386.256741] __kasan_record_aux_stack+0xcc/0x12c [16386.256753] __call_rcu_common.constprop.0+0x94/0xd04 [16386.256766] vm_area_free+0x28/0x3c [16386.256778] remove_vma+0xf4/0x114 [16386.256797] do_vmi_align_munmap.constprop.0+0x684/0x870 [16386.256811] __vm_munmap+0xe0/0x1f8 [16386.256821] sys_munmap+0x54/0x6c [16386.256830] system_call_exception+0x1a0/0x4a0 [16386.256841] system_call_vectored_common+0x15c/0x2ec [16386.256868] The buggy address belongs to the object at c00000014a819670 which belongs to the cache vm_area_struct of size 168 [16386.256887] The buggy address is located 0 bytes inside of freed 168-byte region [c00000014a819670, c00000014a819718) [16386.256915] The buggy address belongs to the physical page: [16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81 [16386.256950] memcg:c0000000ba430001 [16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff) [16386.256975] page_type: 0xfdffffff(slab) [16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001 [16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001 [16386.257018] page dumped because: kasan: bad access detected This patch adds close() callback in vas_vm_ops vm_operations_struct which will be executed during munmap() before freeing VMA. The VMA address in the VAS window is set to NULL after holding the window mmap_mutex. Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler") Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241214051758.997759-1-haren@linux.ibm.com
2024-12-18powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump is activeSourabh Jain
Commit 8597538712eb ("powerpc/fadump: Do not use hugepages when fadump is active") disabled hugetlb support when fadump is active by returning early from hugetlbpage_init():arch/powerpc/mm/hugetlbpage.c and not populating hpage_shift/HPAGE_SHIFT. Later, commit 2354ad252b66 ("powerpc/mm: Update default hugetlb size early") moved the allocation of hpage_shift/HPAGE_SHIFT to early boot, which inadvertently re-enabled hugetlb support when fadump is active. Fix this by implementing hugepages_supported() on powerpc. This ensures that disabling hugetlb for the fadump kernel is independent of hpage_shift/HPAGE_SHIFT. Fixes: 2354ad252b66 ("powerpc/mm: Update default hugetlb size early") Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241217074640.1064510-1-sourabhjain@linux.ibm.com
2024-12-18powerpc/vdso: Mark the vDSO code read-only after initChristophe Leroy
VDSO text is fixed-up during init so it can't be const, but it can be read-only after init. Do the same as x86 in commit 018ef8dcf3de ("x86/vdso: Mark the vDSO code read-only after init") and arm in commit 11bf9b865898 ("ARM/vdso: Mark the vDSO code read-only after init"), move it into ro_after_init section. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/e9892d288b646cbdfeef0b2b73edbaf6d3c6cabe.1734174500.git.christophe.leroy@csgroup.eu
2024-12-18powerpc/64: Use get_user() in start_thread()Michael Ellerman
For ELFv1 binaries (big endian), the ELF entry point isn't the address of the first instruction, instead it points to the function descriptor for the entry point. The address of the first instruction is in the function descriptor. That means the kernel has to fetch the address of the first instruction from user memory. Because start_thread() uses __get_user(), which has no access_ok() checks, it looks like a malicious ELF binary could be crafted to point the entry point address at kernel memory. The kernel would load 8 bytes from kernel memory into the NIP and then start the process, it would typically crash, but a debugger could observe the NIP value which would be the result of reading from kernel memory. However that's NOT possible, because there is a check in load_elf_binary() that ensures the ELF entry point is < TASK_SIZE (look for BAD_ADDR(elf_entry)). However it's fragile for start_thread() to rely on a check elsewhere, even if the ELF parser is unlikely to ever drop the check that elf_entry is a user address. Make it more robust by using get_user(), which checks that the address points at userspace before doing the load. If the address doesn't point at userspace it will just set the result to zero, and the userspace program will crash at zero (which is fine because it's self-inflicted). Note that it's also possible for a malicious binary to have a valid ELF entry address, but with the first instruction address pointing into the kernel. However that's OK, because it is blocked by the MMU, just like any other attempt to jump into the kernel from userspace. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241216121706.26790-1-mpe@ellerman.id.au
2024-12-17x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'Dave Hansen
The 'x86_cpu_desc' and 'x86_cpu_id' structures are very similar. Reduce duplicate infrastructure by moving the few users of 'x86_cpu_desc' to the much more common variant. The existing X86_MATCH_VFM_STEPS() helper matches ranges of steppings. Instead of introducing a single-stepping match function which could get confusing when paired with the range, just use the stepping min/max match helper and use min==max. Note that this makes the table more vertically compact because multiple entries like this: INTEL_CPU_DESC(INTEL_SKYLAKE_X, 4, 0x00000000), INTEL_CPU_DESC(INTEL_SKYLAKE_X, 5, 0x00000000), INTEL_CPU_DESC(INTEL_SKYLAKE_X, 6, 0x00000000), INTEL_CPU_DESC(INTEL_SKYLAKE_X, 7, 0x00000000), can be consolidated down to a single stepping range. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185131.8B610039%40davehans-spike.ostc.intel.com
2024-12-17x86/cpu: Expose only stepping min/max interfaceDave Hansen
The x86_match_cpu() infrastructure can match CPU steppings. Since there are only 16 possible steppings, the matching infrastructure goes all out and stores the stepping match as a bitmap. That means it can match any possible steppings in a single list entry. Fun. But it exposes this bitmap to each of the X86_MATCH_*() helpers when none of them really need a bitmap. It makes up for this by exporting a helper (X86_STEPPINGS()) which converts a contiguous stepping range into the bitmap which every single user leverages. Instead of a bitmap, have the main helper for this sort of thing (X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up actually being even more compact than before. Leave the helper in place (renamed to __X86_STEPPINGS()) to make it more clear what is going on instead of just having a random GENMASK() in the middle of an already complicated macro. One oddity that I hit was this macro: X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues) It *could* have been converted over to take a min/max stepping value for each entry. But that would have been a bit too verbose and would prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0) from sticking out. Instead, just have it take a *maximum* stepping and imply that the match is from 0=>max_stepping. This is functional for all the cases now and also retains the nice property of having INTEL_COMETLAKE_L stepping 0 stick out like a sore thumb. skx_cpuids[] is goofy. It uses the stepping match but encodes all possible steppings. Just use a normal, non-stepping match helper. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
2024-12-17x86/cpu: Introduce new microcode matching helperDave Hansen
The 'x86_cpu_id' and 'x86_cpu_desc' structures are very similar and need to be consolidated. There is a microcode version matching function for 'x86_cpu_desc' but not 'x86_cpu_id'. Create one for 'x86_cpu_id'. This essentially just leverages the x86_cpu_id->driver_data field to replace the less generic x86_cpu_desc->x86_microcode_rev field. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185128.8F24EEFC%40davehans-spike.ostc.intel.com
2024-12-17hexagon: Disable constant extender optimization for LLVM prior to 19.1.0Nathan Chancellor
The Hexagon-specific constant extender optimization in LLVM may crash on Linux kernel code [1], such as fs/bcache/btree_io.c after commit 32ed4a620c54 ("bcachefs: Btree path tracepoints") in 6.12: clang: llvm/lib/Target/Hexagon/HexagonConstExtenders.cpp:745: bool (anonymous namespace)::HexagonConstExtenders::ExtRoot::operator<(const HCE::ExtRoot &) const: Assertion `ThisB->getParent() == OtherB->getParent()' failed. Stack dump: 0. Program arguments: clang --target=hexagon-linux-musl ... fs/bcachefs/btree_io.c 1. <eof> parser at end of file 2. Code generation 3. Running pass 'Function Pass Manager' on module 'fs/bcachefs/btree_io.c'. 4. Running pass 'Hexagon constant-extender optimization' on function '@__btree_node_lock_nopath' Without assertions enabled, there is just a hang during compilation. This has been resolved in LLVM main (20.0.0) [2] and backported to LLVM 19.1.0 but the kernel supports LLVM 13.0.1 and newer, so disable the constant expander optimization using the '-mllvm' option when using a toolchain that is not fixed. Cc: stable@vger.kernel.org Link: https://github.com/llvm/llvm-project/issues/99714 [1] Link: https://github.com/llvm/llvm-project/commit/68df06a0b2998765cb0a41353fcf0919bbf57ddb [2] Link: https://github.com/llvm/llvm-project/commit/2ab8d93061581edad3501561722ebd5632d73892 [3] Reviewed-by: Brian Cain <bcain@quicinc.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-17arm64: dts: bcm4908: nvmem-layout conversionRosen Penev
nvmem-layout is a more flexible replacement for nvmem-cells. Signed-off-by: Rosen Penev <rosenp@gmail.com> Link: https://lore.kernel.org/r/20241203233632.184861-1-rosenp@gmail.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: bcmbca: bcm4908: Add DT for Zyxel EX3510-BSam Edwards
Zyxel EX3510-B is a WiFi 6 capable home gateway (family) based on the BCM4906 SoC, with 512MiB of RAM and 512MiB of NAND flash. WiFi support consists of a BCM6710 and a BCM6715 attached to separate PCIe buses. Add an initial devicetree for this system, with support for: - Onboard UART (per base dtsi) - USB (2.0 only; superspeed devices are treated as high-speed due to an unknown cause) - Both buttons (rear reset, front WPS) - Almost all LEDs: - Power (red/green) - Internet (red/green) - WAN (green) - LAN (green; anode is connected to GPIO 13 so currently nonfunctioning) - USB (green) - WPS button (red/green) - Absent in DT: There are 2.4GHz/5.0GHz WiFi status LEDs connected to the WiFi chips instead of the SoC. - NAND flash - Embedded Ethernet switch - Factory-programmed Ethernet MAC address WiFi cannot be enabled at this time due to Linux lacking drivers for both the PCIe controllers and the PCIe WiFi peripherals. Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://lore.kernel.org/r/20241009215454.1449508-3-CFSworks@gmail.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: bcmbca: bcm4908: Protect cpu-release-addrSam Edwards
The `cpu-release-addr` property is relevant only when the "spin-table" enable method is used. It is the physical address where the bootloader expects Linux to write the secondary CPU entry point's physical address. On this platform, only the CFE bootloader uses this method: U-Boot uses PSCI instead. CFE actually walks the FDT to learn this address, so we're free to put it wherever we want. We only need to make sure that it goes in a reserved-memory block so that writing to it during early boot does not risk conflicting with an unrelated memory allocation: this was not done. Since the previous patch reserved the first page of memory for CFE's secondary-CPU init stub, which is actually much smaller than a page, just put this address at the end of that page and it shall be so protected. Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://lore.kernel.org/r/20241005050155.61103-3-CFSworks@gmail.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: bcmbca: bcm4908: Reserve CFE stub areaSam Edwards
The CFE bootloader places a stub program in the first page of physical memory to hold the secondary CPUs until the boot CPU writes the release address, but does not splice a /reserved-memory node into the FDT to protect it. If Linux overwrites this program before execution reaches smp_prepare_cpus(), the secondary CPUs may become inaccessible. This is only a problem with CFE, and then only until the secondary CPUs are brought online. Ideally, there would be some hypothetical mechanism we could use to indicate that this area of memory is sensitive only during boot. But as there is none, and since it is such a small amount of memory, it is easiest to reserve it unconditionally. Therefore, add a /reserved-memory node to bcm4908.dtsi to protect the first 4KiB of physical memory. Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://lore.kernel.org/r/20241005050155.61103-2-CFSworks@gmail.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: Remove unused and undocumented propertiesRob Herring (Arm)
Remove properties which are both unused in the kernel and undocumented. Most likely they are leftovers from downstream. Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Link: https://lore.kernel.org/r/20241115193854.3624123-1-robh@kernel.org Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: Add DT for D-step version of BCM2712Dave Stevenson
The D-Step has some minor variations in the hardware, so needs matching changes to DT. Add a new DTS file that modifies the existing (C-step) devicetree. Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com> Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-36-35efa83c8fc0@raspberrypi.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: Add display pipeline support to BCM2712Dave Stevenson
Adds the HVS and associated hardware blocks to support the HDMI and writeback connectors on BCM2712 / Pi5. Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com> Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-35-35efa83c8fc0@raspberrypi.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-12-17arm64: dts: broadcom: Add firmware clocks and power nodes to Pi5 DTDave Stevenson
BCM2712 still uses the firmware clocks and power drivers, so add them to the base device tree. The brcm,bcm2836-l1-intc controller isn't used on this platform. It is used on 32-bit kernels for the smp_boot_secondary hook, but BCM2712 can't run a 32-bit kernel. Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com> Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-34-35efa83c8fc0@raspberrypi.com Link: https://lore.kernel.org/r/20241212-dt-bcm2712-fixes-v3-7-44a7f3390331@raspberrypi.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>