Age | Commit message (Collapse) | Author |
|
Refactor kvm_set_cpu_caps() to express each supported (or not) feature
flag on a separate line, modulo a handful of cases where KVM does not, and
likely will not, support a sequence of flags. This will allow adding
fancier macros with longer, more descriptive names without resulting in
absurd line lengths and/or weird code. Isolating each flag also makes it
far easier to review changes, reduces code conflicts, and generally makes
it easier to resolve conflicts. Lastly, it allows co-locating comments
for notable flags, e.g. MONITOR, precisely with the relevant flag.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly zero out the feature word in kvm_cpu_caps if the word's
associated CPUID function is greater than the max leaf supported by the
CPU. For such unsupported functions, Intel CPUs return the output from
the last supported leaf, not all zeros.
Practically speaking, this is likely a benign bug, as KVM uses the raw
host CPUID to mask the kernel's computed capabilities, and the kernel does
perform max leaf checks when populating boot_cpu_data. The only way KVM's
goof could be problematic is if the kernel force-set a feature in a leaf
that is completely unsupported, _and_ the max supported leaf happened to
return a value with '1' the same bit position. Which is theoretically
possible, but extremely unlikely. And even if that did happen, it's
entirely possible that KVM would still provide the correct functionality;
the kernel did set the capability after all.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Do the compile-time sanity checks on reverse_cpuid in __feature_leaf() so
that higher level APIs don't need to "manually" perform the sanity checks.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Revert the chunk of commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") that forced a PV features cache refresh
during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering
issue it alleged to have fixed never existed upstream, and likely never
existed in any kernel.
At the time of the commit, there was a tangentially related ioctl()
ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2
would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set.
But (a) that bug affected the entire guest CPUID, not just the cache, (b)
commit 01b4f510b9f4 didn't address that bug, it only refreshed the cache
(with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU
creation is completely broken as KVM configures HLT-exiting only during
vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if
vCPUs have been created.
Another tangentially related bug was KVM's failure to clear the cache when
handling KVM_SET_CPUID2, but again commit 01b4f510b9f4 did nothing to fix
that bug.
The most plausible explanation for the what commit 01b4f510b9f4 was trying
to fix is a bug that existed in Google's internal kernel that was the
source of commit 01b4f510b9f4. At the time, Google's internal kernel had
not yet picked up commit 0d3b2ba16ba68 ("KVM: X86: Go on updating other
CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the
PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry.
Of course, no sane real world VMM would omit CPUID.0x1, including the KVM
selftest added by commit ac4a4d6de22e ("selftests: kvm: test enforcement
of paravirtual cpuid features"). And the test didn't actually try to
verify multiple orderings, nor did the selftest enter the guest without
doing KVM_SET_CPUID2, so who knows what motivated the change.
Regardless of why commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") was added, refreshing the cache during
KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary.
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Clear KVM's PV feature cache prior when processing a new guest CPUID so
that KVM doesn't keep a stale cache entry if userspace does KVM_SET_CPUID2
multiple times, once with a PV features entry, and a second time without.
Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or
HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that
disabling the exit(s) is not allowed. E.g. because MWAIT isn't supported
or the CPU doesn't have an always-running APIC timer, or because KVM is
configured to mitigate cross-thread vulnerabilities.
Cc: Kechen Lu <kechenl@nvidia.com>
Fixes: 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts")
Fixes: 6f0f2d5ef895 ("KVM: x86: Mitigate the cross-thread return address predictions bug")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation. vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the manual initialization of maxphyaddr and reserved_gpa_bits during
vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes
kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching.
None of the helpers between the existing code in kvm_arch_vcpu_create()
and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or
reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create()
isn't exactly easy).
Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the manual kvm_pmu_refresh() from kvm_pmu_init() now that
kvm_arch_vcpu_create() performs the refresh via kvm_vcpu_after_set_cpuid().
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits
no longer exists, as keeping cr4_reserved_bits local to x86.c was the only
reason for "hiding" the definition of __kvm_is_valid_cr4().
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's
reserved bits into the guest's reserved bits. This fixes a bug where VMX's
set_cr4_guest_host_mask() fails to account for KVM-reserved bits when
deciding which bits can be passed through to the guest. In most cases,
letting the guest directly write reserved CR4 bits is ok, i.e. attempting
to set the bit(s) will still #GP, but not if a feature is available in
hardware but explicitly disabled by the host, e.g. if FSGSBASE support is
disabled via "nofsgsbase".
Note, the extra overhead of computing host reserved bits every time
userspace sets guest CPUID is negligible. The feature bits that are
queried are packed nicely into a handful of words, and so checking and
setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the
total cost will be in the noise even if the number of checked CR4 bits
doubles over the next few years. In other words, x86 will run out of CR4
bits long before the overhead becomes problematic.
Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is
why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in
CR4_RESERVED_BITS (and why the new code doesn't do so either).
Fixes: 2ed41aa631fc ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly perform runtime CPUID adjustments as part of the "after set
CPUID" flow to guard against bugs where KVM consumes stale vCPU/CPUID
state during kvm_update_cpuid_runtime(). E.g. see commit 4736d85f0d18
("KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT").
Whacking each mole individually is not sustainable or robust, e.g. while
the aforemention commit fixed KVM's PV features, the same issue lurks for
Xen and Hyper-V features, Xen and Hyper-V simply don't have any runtime
features (though spoiler alert, neither should KVM).
Updating runtime features in the "full" path will also simplify adding a
snapshot of the guest's capabilities, i.e. of caching the intersection of
guest CPUID and kvm_cpu_caps (modulo a few edge cases).
Link: https://lore.kernel.org/r/20241128013424.4096668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
During vCPU creation, process KVM's default, empty CPUID as if userspace
set an empty CPUID to ensure consistent and correct behavior with respect
to guest CPUID. E.g. if userspace never sets guest CPUID, KVM will never
configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest-
visible behavior due to letting the guest set any KVM-supported CR4 bits
despite the features not being allowed per guest CPUID.
Note! This changes KVM's ABI, as lack of full CPUID processing allowed
userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a
guest-unsupported value via KVM_SET_SREGS. But it's extremely unlikely
that this is a breaking change, as KVM already has many flows that require
userspace to set guest CPUID before loading vCPU state. E.g. multiple MSR
flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already
relies on guest CPUID being up-to-date, as KVM's validity check on CR3
consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR).
Furthermore, the plan is to commit to enforcing guest CPUID for userspace
writes to MSRs, at which point bypassing sregs CPUID checks is even more
nonsensical.
Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Define and undefine the F() and SF() macros precisely around
kvm_set_cpu_caps() to make it all but impossible to use the macros outside
of kvm_cpu_cap_{mask,init_kvm_defined}(). Currently, F() is a simple
passthrough, but SF() is actively dangerous as it checks that the scattered
feature is supported by the host kernel.
And usage outside of the aforementioned helpers will run afoul of future
changes to harden KVM's CPUID management.
Opportunistically switch to feature_bit() when stuffing LA57 based on raw
hardware support.
No functional change intended.
Link: https://lore.kernel.org/r/20241128013424.4096668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When clearing CONSTANT_TSC during CPUID emulation due to a Hyper-V quirk,
use feature_bit() instead of SF() to ensure the bit is actually cleared.
SF() evaluates to zero if the _host_ doesn't support the feature. I.e.
KVM could keep the bit set if userspace advertised CONSTANT_TSC despite
it not being supported in hardware.
Note, translating from a scattered feature to a the hardware version is
done by __feature_translate(), not SF(). The sole purpose of SF() is to
check kernel support for the scattered feature, *before* translation.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
An important distinction from other registers affected by HPMN is that
PMCR_EL0 only affects the guest range of counters, regardless of the EL
from which it is accessed. Ensure that PMCR_EL0.P is always applied to
'guest' counters by manually computing the mask rather than deriving it
from the current context.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175611.3658290-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
MDCR_EL2.HPME is the 'global' enable bit for event counters reserved for
EL2. Give the PMU a kick when it's changed to ensure events are
reprogrammed before returning to the guest.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175550.3658212-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Nested virt introduces yet another set of 'global' knobs for controlling
event counters that are reserved for EL2 (i.e. >= HPMN). Get ready to
share some plumbing with the NV controls by offloading counter
reprogramming to KVM_REQ_RELOAD_PMU.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175532.3658134-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Having separate helpers for enabling/disabling counters provides the
wrong abstraction, as the state of each counter needs to be evaluated
independently and, in some cases, use a different global enable bit.
Collapse the enable/disable accessors into a single, common helper that
reconfigures every counter set in @mask, leaving the complexity of
determining if an event is actually enabled in
kvm_pmu_counter_is_enabled().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175513.3658056-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
There are multiple pKVM memory transitions where the state of a page is
not cross-checked from the completer's PoV for performance reasons.
For example, if a page is PKVM_PAGE_OWNED from the initiator's PoV,
we should be guaranteed by construction that it is PKVM_NOPAGE for
everybody else, hence allowing us to save a page-table lookup.
When it was introduced, hyp_ack_unshare() followed that logic and bailed
out without checking the PKVM_PAGE_SHARED_BORROWED state in the
hypervisor's stage-1. This was correct as we could safely assume that
all host-initiated shares were directed at the hypervisor at the time.
But with the introduction of other types of shares (e.g. for FF-A or
non-protected guests), it is now very much required to cross check this
state to prevent the host from running __pkvm_host_unshare_hyp() on a
page shared with TZ or a non-protected guest.
Thankfully, if an attacker were to try this, the hyp_unmap() call from
hyp_complete_unshare() would fail, hence causing to WARN() from
__do_unshare() with the host lock held, which is fatal. But this is
fragile at best, and can hardly be considered a security measure.
Let's just do the right thing and always check the state from
hyp_ack_unshare().
Signed-off-by: Quentin Perret <qperret@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241128154406.602875-1-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Various fixes to Hyper-V tools in the kernel tree (Dexuan Cui, Olaf
Hering, Vitaly Kuznetsov)
- Fix a bug in the Hyper-V TSC page based sched_clock() (Naman Jain)
- Two bug fixes in the Hyper-V utility functions (Michael Kelley)
- Convert open-coded timeouts to secs_to_jiffies() in Hyper-V drivers
(Easwar Hariharan)
* tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
tools/hv: reduce resource usage in hv_kvp_daemon
tools/hv: add a .gitignore file
tools/hv: reduce resouce usage in hv_get_dns_info helper
hv/hv_kvp_daemon: Pass NIC name to hv_get_dns_info as well
Drivers: hv: util: Avoid accessing a ringbuffer not initialized yet
Drivers: hv: util: Don't force error code to ENODEV in util_probe()
tools/hv: terminate fcopy daemon if read from uio fails
drivers: hv: Convert open-coded timeouts to secs_to_jiffies()
tools: hv: change permissions of NetworkManager configuration file
x86/hyperv: Fix hv tsc page based sched_clock for hibernation
tools: hv: Fix a complier warning in the fcopy uio daemon
|
|
Niklas Schnelle says:
===================
This patch series enhances the introspectability of the PCI device
recovery for firmware. Until now when Linux performs recovery in
response to a firmware error report. For example, until now firmware
debug data would have no indication if the recovery was successfull or
if it failed, for example due to KVM pass-through.
Improve on this by reporting recovery status as well as some debug
information such as device driver name and s390dbf/pci_msg/sprintf logs
via the SCLP Write Event Data Action Qualifier 2 (Log Data provided)
mechanism.
===================
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix
for consistency and vertical alignment.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits
Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
|
|
The CPUID level dependency table will entirely zap X86_FEATURE_XSAVE
if the CPUID level is too low. This code is unreachable. Kill it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/all/20241213205038.6E71F9A4%40davehans-spike.ostc.intel.com
|
|
Move the XSAVE-related CPUID leaf definitions to common code. Then,
use the new definition to remove the last magic number from the CPUID
level dependency table.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205037.43C57CDE%40davehans-spike.ostc.intel.com
|
|
All the code that reads the CPUID frequency information leaf hard-codes
a magic number. Give it a symbolic name and use it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205036.4397658F%40davehans-spike.ostc.intel.com
|
|
The TSC code has a bunch of hard-coded references to leaf 0x15. Change
them over to the symbolic name. Also zap the 'ART_CPUID_LEAF' definition.
It was a duplicate of 'CPUID_TSC_LEAF'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213205034.B79D6224%40davehans-spike.ostc.intel.com
|
|
Prepare to use the TSC CPUID leaf definition more widely by moving
it to the common header.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205033.68799E53%40davehans-spike.ostc.intel.com
|
|
The DCA leaf number is also hard-coded in the CPUID level dependency
table. Move its definition to common code and use it.
While at it, fix up the naming and types in the probe code. All
CPUID data is provided in 32-bit registers, not 'unsigned long'.
Also stop referring to "level_9". Move away from test_bit()
because the type is no longer an 'unsigned long'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205032.476A30FE%40davehans-spike.ostc.intel.com
|
|
The CPUID leaf dependency checker will remove X86_FEATURE_MWAIT if
the CPUID level is below the required level (CPUID_MWAIT_LEAF).
Thus, if you check X86_FEATURE_MWAIT you do not need to also
check the CPUID level.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213205030.9B42B458%40davehans-spike.ostc.intel.com
|
|
The leaf-to-feature dependency array uses hard-coded leaf numbers.
Use the new common header definition for the MWAIT leaf.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205029.5B055D6E%40davehans-spike.ostc.intel.com
|
|
Begin constructing a common place to keep all CPUID leaf definitions.
Move CPUID_MWAIT_LEAF to the CPUID header and include it where
needed.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205028.EE94D02A%40davehans-spike.ostc.intel.com
|
|
All the users of 'x86_cpu_desc' are gone. Zap it from the tree.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185133.AF0BF2BC%40davehans-spike.ostc.intel.com
|
|
The AMD erratum 1386 detection code uses and old style 'x86_cpu_desc'
table. Replace it with 'x86_cpu_id' so the old style can be removed.
I did not create a new helper macro here. The new table is certainly
more noisy than the old and it can be improved on. But I was hesitant
to create a new macro just for a single site that is only two ugly
lines in the end.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185132.07555E1D%40davehans-spike.ostc.intel.com
|
|
The mapping VMA address is saved in VAS window struct when the
paste address is mapped. This VMA address is used during migration
to unmap the paste address if the window is active. The paste
address mapping will be removed when the window is closed or with
the munmap(). But the VMA address in the VAS window is not updated
with munmap() which is causing invalid access during migration.
The KASAN report shows:
[16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8
[16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928
[16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G B 6.11.0-rc5-nxgzip #2
[16386.255128] Tainted: [B]=BAD_PAGE
[16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries
[16386.255181] Call Trace:
[16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable)
[16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764
[16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8
[16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0
[16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8
[16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc
[16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4
...
[16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s:
[16386.256149] kasan_save_stack+0x34/0x68
[16386.256163] kasan_save_track+0x34/0x80
[16386.256175] kasan_save_alloc_info+0x58/0x74
[16386.256196] __kasan_slab_alloc+0xb8/0xdc
[16386.256209] kmem_cache_alloc_noprof+0x200/0x3d0
[16386.256225] vm_area_alloc+0x44/0x150
[16386.256245] mmap_region+0x214/0x10c4
[16386.256265] do_mmap+0x5fc/0x750
[16386.256277] vm_mmap_pgoff+0x14c/0x24c
[16386.256292] ksys_mmap_pgoff+0x20c/0x348
[16386.256303] sys_mmap+0xd0/0x160
...
[16386.256350] Freed by task 0 on cpu 31 at 16386.204848s:
[16386.256363] kasan_save_stack+0x34/0x68
[16386.256374] kasan_save_track+0x34/0x80
[16386.256384] kasan_save_free_info+0x64/0x10c
[16386.256396] __kasan_slab_free+0x120/0x204
[16386.256415] kmem_cache_free+0x128/0x450
[16386.256428] vm_area_free_rcu_cb+0xa8/0xd8
[16386.256441] rcu_do_batch+0x2c8/0xcf0
[16386.256458] rcu_core+0x378/0x3c4
[16386.256473] handle_softirqs+0x20c/0x60c
[16386.256495] do_softirq_own_stack+0x6c/0x88
[16386.256509] do_softirq_own_stack+0x58/0x88
[16386.256521] __irq_exit_rcu+0x1a4/0x20c
[16386.256533] irq_exit+0x20/0x38
[16386.256544] interrupt_async_exit_prepare.constprop.0+0x18/0x2c
...
[16386.256717] Last potentially related work creation:
[16386.256729] kasan_save_stack+0x34/0x68
[16386.256741] __kasan_record_aux_stack+0xcc/0x12c
[16386.256753] __call_rcu_common.constprop.0+0x94/0xd04
[16386.256766] vm_area_free+0x28/0x3c
[16386.256778] remove_vma+0xf4/0x114
[16386.256797] do_vmi_align_munmap.constprop.0+0x684/0x870
[16386.256811] __vm_munmap+0xe0/0x1f8
[16386.256821] sys_munmap+0x54/0x6c
[16386.256830] system_call_exception+0x1a0/0x4a0
[16386.256841] system_call_vectored_common+0x15c/0x2ec
[16386.256868] The buggy address belongs to the object at c00000014a819670
which belongs to the cache vm_area_struct of size 168
[16386.256887] The buggy address is located 0 bytes inside of
freed 168-byte region [c00000014a819670, c00000014a819718)
[16386.256915] The buggy address belongs to the physical page:
[16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81
[16386.256950] memcg:c0000000ba430001
[16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff)
[16386.256975] page_type: 0xfdffffff(slab)
[16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001
[16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001
[16386.257018] page dumped because: kasan: bad access detected
This patch adds close() callback in vas_vm_ops vm_operations_struct
which will be executed during munmap() before freeing VMA. The VMA
address in the VAS window is set to NULL after holding the window
mmap_mutex.
Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241214051758.997759-1-haren@linux.ibm.com
|
|
Commit 8597538712eb ("powerpc/fadump: Do not use hugepages when fadump
is active") disabled hugetlb support when fadump is active by returning
early from hugetlbpage_init():arch/powerpc/mm/hugetlbpage.c and not
populating hpage_shift/HPAGE_SHIFT.
Later, commit 2354ad252b66 ("powerpc/mm: Update default hugetlb size
early") moved the allocation of hpage_shift/HPAGE_SHIFT to early boot,
which inadvertently re-enabled hugetlb support when fadump is active.
Fix this by implementing hugepages_supported() on powerpc. This ensures
that disabling hugetlb for the fadump kernel is independent of
hpage_shift/HPAGE_SHIFT.
Fixes: 2354ad252b66 ("powerpc/mm: Update default hugetlb size early")
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241217074640.1064510-1-sourabhjain@linux.ibm.com
|
|
VDSO text is fixed-up during init so it can't be const,
but it can be read-only after init.
Do the same as x86 in commit 018ef8dcf3de ("x86/vdso: Mark the vDSO
code read-only after init") and arm in commit 11bf9b865898 ("ARM/vdso:
Mark the vDSO code read-only after init"), move it into
ro_after_init section.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/e9892d288b646cbdfeef0b2b73edbaf6d3c6cabe.1734174500.git.christophe.leroy@csgroup.eu
|
|
For ELFv1 binaries (big endian), the ELF entry point isn't the address
of the first instruction, instead it points to the function descriptor
for the entry point. The address of the first instruction is in the
function descriptor.
That means the kernel has to fetch the address of the first instruction
from user memory.
Because start_thread() uses __get_user(), which has no access_ok()
checks, it looks like a malicious ELF binary could be crafted to point
the entry point address at kernel memory. The kernel would load 8 bytes
from kernel memory into the NIP and then start the process, it would
typically crash, but a debugger could observe the NIP value which would
be the result of reading from kernel memory.
However that's NOT possible, because there is a check in
load_elf_binary() that ensures the ELF entry point is < TASK_SIZE
(look for BAD_ADDR(elf_entry)).
However it's fragile for start_thread() to rely on a check elsewhere,
even if the ELF parser is unlikely to ever drop the check that elf_entry
is a user address.
Make it more robust by using get_user(), which checks that the address
points at userspace before doing the load. If the address doesn't point
at userspace it will just set the result to zero, and the userspace
program will crash at zero (which is fine because it's self-inflicted).
Note that it's also possible for a malicious binary to have a valid
ELF entry address, but with the first instruction address pointing into
the kernel. However that's OK, because it is blocked by the MMU, just
like any other attempt to jump into the kernel from userspace.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241216121706.26790-1-mpe@ellerman.id.au
|
|
The 'x86_cpu_desc' and 'x86_cpu_id' structures are very similar.
Reduce duplicate infrastructure by moving the few users of
'x86_cpu_desc' to the much more common variant.
The existing X86_MATCH_VFM_STEPS() helper matches ranges of
steppings. Instead of introducing a single-stepping match function
which could get confusing when paired with the range, just use
the stepping min/max match helper and use min==max.
Note that this makes the table more vertically compact because
multiple entries like this:
INTEL_CPU_DESC(INTEL_SKYLAKE_X, 4, 0x00000000),
INTEL_CPU_DESC(INTEL_SKYLAKE_X, 5, 0x00000000),
INTEL_CPU_DESC(INTEL_SKYLAKE_X, 6, 0x00000000),
INTEL_CPU_DESC(INTEL_SKYLAKE_X, 7, 0x00000000),
can be consolidated down to a single stepping range.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185131.8B610039%40davehans-spike.ostc.intel.com
|
|
The x86_match_cpu() infrastructure can match CPU steppings. Since
there are only 16 possible steppings, the matching infrastructure goes
all out and stores the stepping match as a bitmap. That means it can
match any possible steppings in a single list entry. Fun.
But it exposes this bitmap to each of the X86_MATCH_*() helpers when
none of them really need a bitmap. It makes up for this by exporting a
helper (X86_STEPPINGS()) which converts a contiguous stepping range
into the bitmap which every single user leverages.
Instead of a bitmap, have the main helper for this sort of thing
(X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up
actually being even more compact than before.
Leave the helper in place (renamed to __X86_STEPPINGS()) to make it
more clear what is going on instead of just having a random GENMASK()
in the middle of an already complicated macro.
One oddity that I hit was this macro:
X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues)
It *could* have been converted over to take a min/max stepping value
for each entry. But that would have been a bit too verbose and would
prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0)
from sticking out.
Instead, just have it take a *maximum* stepping and imply that the match
is from 0=>max_stepping. This is functional for all the cases now and
also retains the nice property of having INTEL_COMETLAKE_L stepping 0
stick out like a sore thumb.
skx_cpuids[] is goofy. It uses the stepping match but encodes all
possible steppings. Just use a normal, non-stepping match helper.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
|
|
The 'x86_cpu_id' and 'x86_cpu_desc' structures are very similar and
need to be consolidated. There is a microcode version matching
function for 'x86_cpu_desc' but not 'x86_cpu_id'.
Create one for 'x86_cpu_id'.
This essentially just leverages the x86_cpu_id->driver_data field
to replace the less generic x86_cpu_desc->x86_microcode_rev field.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185128.8F24EEFC%40davehans-spike.ostc.intel.com
|
|
The Hexagon-specific constant extender optimization in LLVM may crash on
Linux kernel code [1], such as fs/bcache/btree_io.c after
commit 32ed4a620c54 ("bcachefs: Btree path tracepoints") in 6.12:
clang: llvm/lib/Target/Hexagon/HexagonConstExtenders.cpp:745: bool (anonymous namespace)::HexagonConstExtenders::ExtRoot::operator<(const HCE::ExtRoot &) const: Assertion `ThisB->getParent() == OtherB->getParent()' failed.
Stack dump:
0. Program arguments: clang --target=hexagon-linux-musl ... fs/bcachefs/btree_io.c
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'fs/bcachefs/btree_io.c'.
4. Running pass 'Hexagon constant-extender optimization' on function '@__btree_node_lock_nopath'
Without assertions enabled, there is just a hang during compilation.
This has been resolved in LLVM main (20.0.0) [2] and backported to LLVM
19.1.0 but the kernel supports LLVM 13.0.1 and newer, so disable the
constant expander optimization using the '-mllvm' option when using a
toolchain that is not fixed.
Cc: stable@vger.kernel.org
Link: https://github.com/llvm/llvm-project/issues/99714 [1]
Link: https://github.com/llvm/llvm-project/commit/68df06a0b2998765cb0a41353fcf0919bbf57ddb [2]
Link: https://github.com/llvm/llvm-project/commit/2ab8d93061581edad3501561722ebd5632d73892 [3]
Reviewed-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
nvmem-layout is a more flexible replacement for nvmem-cells.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://lore.kernel.org/r/20241203233632.184861-1-rosenp@gmail.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
Zyxel EX3510-B is a WiFi 6 capable home gateway (family) based on the
BCM4906 SoC, with 512MiB of RAM and 512MiB of NAND flash. WiFi support
consists of a BCM6710 and a BCM6715 attached to separate PCIe buses.
Add an initial devicetree for this system, with support for:
- Onboard UART (per base dtsi)
- USB (2.0 only; superspeed devices are treated as high-speed due to an
unknown cause)
- Both buttons (rear reset, front WPS)
- Almost all LEDs:
- Power (red/green)
- Internet (red/green)
- WAN (green)
- LAN (green; anode is connected to GPIO 13 so currently
nonfunctioning)
- USB (green)
- WPS button (red/green)
- Absent in DT: There are 2.4GHz/5.0GHz WiFi status LEDs connected to
the WiFi chips instead of the SoC.
- NAND flash
- Embedded Ethernet switch
- Factory-programmed Ethernet MAC address
WiFi cannot be enabled at this time due to Linux lacking drivers for
both the PCIe controllers and the PCIe WiFi peripherals.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Link: https://lore.kernel.org/r/20241009215454.1449508-3-CFSworks@gmail.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
The `cpu-release-addr` property is relevant only when the "spin-table"
enable method is used. It is the physical address where the bootloader
expects Linux to write the secondary CPU entry point's physical address.
On this platform, only the CFE bootloader uses this method: U-Boot uses
PSCI instead.
CFE actually walks the FDT to learn this address, so we're free to put
it wherever we want. We only need to make sure that it goes in a
reserved-memory block so that writing to it during early boot does not
risk conflicting with an unrelated memory allocation: this was not done.
Since the previous patch reserved the first page of memory for CFE's
secondary-CPU init stub, which is actually much smaller than a page,
just put this address at the end of that page and it shall be so
protected.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Link: https://lore.kernel.org/r/20241005050155.61103-3-CFSworks@gmail.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
The CFE bootloader places a stub program in the first page of physical
memory to hold the secondary CPUs until the boot CPU writes the release
address, but does not splice a /reserved-memory node into the FDT to
protect it. If Linux overwrites this program before execution reaches
smp_prepare_cpus(), the secondary CPUs may become inaccessible.
This is only a problem with CFE, and then only until the secondary CPUs
are brought online. Ideally, there would be some hypothetical mechanism
we could use to indicate that this area of memory is sensitive only
during boot. But as there is none, and since it is such a small amount
of memory, it is easiest to reserve it unconditionally.
Therefore, add a /reserved-memory node to bcm4908.dtsi to protect the
first 4KiB of physical memory.
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Link: https://lore.kernel.org/r/20241005050155.61103-2-CFSworks@gmail.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
Remove properties which are both unused in the kernel and undocumented.
Most likely they are leftovers from downstream.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20241115193854.3624123-1-robh@kernel.org
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
The D-Step has some minor variations in the hardware, so needs
matching changes to DT.
Add a new DTS file that modifies the existing (C-step) devicetree.
Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-36-35efa83c8fc0@raspberrypi.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
Adds the HVS and associated hardware blocks to support the HDMI
and writeback connectors on BCM2712 / Pi5.
Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-35-35efa83c8fc0@raspberrypi.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|
|
BCM2712 still uses the firmware clocks and power drivers, so add
them to the base device tree.
The brcm,bcm2836-l1-intc controller isn't used on this platform.
It is used on 32-bit kernels for the smp_boot_secondary hook, but
BCM2712 can't run a 32-bit kernel.
Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com>
Link: https://lore.kernel.org/r/20241025-drm-vc4-2712-support-v2-34-35efa83c8fc0@raspberrypi.com
Link: https://lore.kernel.org/r/20241212-dt-bcm2712-fixes-v3-7-44a7f3390331@raspberrypi.com
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
|