Age | Commit message (Collapse) | Author |
|
Drop KVM's odd restriction that disallows clearing CPUID_FAULT in
MSR_PLATFORM_INFO if CPL>0 CPUID faulting is enabled in
MSR_MISC_FEATURES_ENABLES. KVM generally doesn't require specific
ordering when userspace sets MSRs, and the completely arbitrary order of
MSRs in emulated_msrs_all means that a userspace that uses KVM's list
verbatim could run afoul of the check.
Dropping the restriction obviously means that userspace could stuff a
nonsensical vCPU model, but that's the case all over KVM. KVM typically
restricts userspace MSR writes only when it makes things easier for KVM
and/or userspace.
Link: https://lore.kernel.org/r/20240802185511.305849-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject userspace accesses to ARCH_CAPABILITIES if the MSR isn't supposed
to exist, according to guest CPUID. However, "reject" accesses with
KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are
ignored if KVM advertised support ARCH_CAPABILITIES.
KVM's ABI is that userspace must set guest CPUID prior to setting MSRs,
and that setting MSRs that aren't supposed exist is disallowed (modulo the
'0' exemption).
Link: https://lore.kernel.org/r/20240802185511.305849-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the restriction that the PMU version is non-zero when handling writes
to PERF_CAPABILITIES now that KVM unconditionally checks for PDCM support.
Link: https://lore.kernel.org/r/20240802185511.305849-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject userspace accesses to PERF_CAPABILITIES if PDCM isn't set in guest
CPUID, i.e. if the vCPU doesn't actually have PERF_CAPABILITIES. But! Do
so via KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0'
are ignored if KVM advertised support PERF_CAPABILITIES.
KVM's ABI is that userspace must set guest CPUID prior to setting MSRs,
and that setting MSRs that aren't supposed exist is disallowed (modulo the
'0' exemption).
Link: https://lore.kernel.org/r/20240802185511.305849-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore. E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.
Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Tag MSR_PLATFORM_INFO as a feature MSR (because it is), i.e. disallow it
from being modified after the vCPU has run.
To make KVM's selftest compliant, simply delete the userspace MSR write
that restores KVM's original value at the end of the test. Verifying that
userspace can write back what it originally read is uninteresting in this
particular case, because KVM doesn't enforce _any_ bits in the MSR, i.e.
userspace should be able to write any arbitrary value.
Link: https://lore.kernel.org/r/20240802185511.305849-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Bunch all of the feature MSR initialization in kvm_arch_vcpu_create() so
that it can be easily quirked in a future patch.
No functional change intended.
Link: https://lore.kernel.org/r/20240802185511.305849-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Document a flaw in KVM's ABI which lets userspace attempt to inject a
"bad" hardware exception event, and thus induce VM-Fail on Intel CPUs.
Fixing the flaw is a fool's errand, as AMD doesn't sanity check the
validity of the error code, Intel CPUs that support CET relax the check
for Protected Mode, userspace can change the mode after queueing an
exception, KVM ignores the error code when emulating Real Mode exceptions,
and so on and so forth.
The VM-Fail itself doesn't harm KVM or the kernel beyond triggering a
ratelimited pr_warn(), so just document the oddity.
Link: https://lore.kernel.org/r/20240802200420.330769-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
HOST_RIP canonical check should check the L1 of CR4.LA57 stored in
the vmcs12 rather than the current L1's because it is legal to change
the CR4.LA57 value during VM exit from L2 to L1.
This is a theoretical bug though, because it is highly unlikely that a
VM exit will change the CR4.LA57 from the value it had on VM entry.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-5-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.
In particular:
1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.
2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).
3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.
Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).
In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.
Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add emulation flags for MSR accesses and Descriptor Tables loads, and pass
the new flags as appropriate to emul_is_noncanonical_address(). The flags
will be used to perform the correct canonical check, as the type of access
affects whether or not CR4.LA57 is consulted when determining the canonical
bit.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: split to separate patch, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in
the emulator, which will allow extending is_noncanonical_address() to
support different flavors of canonical checks, e.g. for descriptor table
bases vs. MSRs, without needing duplicate logic in the emulator.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: separate from additional of flags, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop x86.h include from cpuid.h to allow the x86.h to include the cpuid.h
instead.
Also fix various places where x86.h was implicitly included via cpuid.h
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-2-mlevitsk@redhat.com
[sean: fixup a missed include in mtrr.c]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly return '0' for guest RIP when handling a PMI VM-Exit for a vCPU
with protected guest state, i.e. when KVM can't read the real RIP. While
there is no "right" value, and profiling a protect guest is rather futile,
returning the last known RIP is worse than returning obviously "bad" data.
E.g. for SEV-ES+, the last known RIP will often point somewhere in the
guest's boot flow.
Opportunistically add WARNs to effectively assert that the in_kernel() and
get_ip() callbacks are restricted to the common PMI handler, as the return
values for the protected guest state case are largely arbitrary, i.e. only
make any sense whatsoever for PMIs, where the returned values have no
functional impact and thus don't truly matter.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When lockdep is enabled, assert that KVM accesses the register caches if
and only if cache fills are guaranteed to consume fresh data, i.e. when
KVM when KVM is in control of the code sequence. Concretely, the caches
can only be used from task context (synchronous) or when handling a PMI
VM-Exit (asynchronous, but only in specific windows where the caches are
in a known, stable state).
Generally speaking, there are very few flows where reading register state
from an asynchronous context is correct or even necessary. So, rather
than trying to figure out a generic solution, simply disallow using the
caches outside of task context by default, and deal with any future
exceptions on a case-by-case basis _if_ they arise.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When querying guest CPL to determine if a vCPU was preempted while in
kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from
the VMCS on Intel CPUs. If the kernel is running with full preemption
enabled, using the register cache in the preemption path can result in
stale and/or uninitialized data being cached in the segment cache.
In particular the following scenario is currently possible:
- vCPU is just created, and the vCPU thread is preempted before
SS.AR_BYTES is written in vmx_vcpu_reset().
- When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.
- vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
invoke vmx_segment_cache_clear() to invalidate the cache.
As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.
Note, the VM-Enter failure can also be avoided by moving the call to
vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all
segments. However, while that change is correct and desirable (and will
come along shortly), it does not address the underlying problem that
accessing KVM's register caches from !task context is generally unsafe.
In addition to fixing the immediate bug, bypassing the cache for this
particular case will allow hardening KVM register caching log to assert
that the caches are accessed only when KVM _knows_ it is safe to do so.
Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
From Intel's documentation [1], "CPUID.(EAX=07H,ECX=0):EDX[26]
enumerates support for indirect branch restricted speculation (IBRS)
and the indirect branch predictor barrier (IBPB)." Further, from [2],
"Software that executed before the IBPB command cannot control the
predicted targets of indirect branches (4) executed after the command
on the same logical processor," where footnote 4 reads, "Note that
indirect branches include near call indirect, near jump indirect and
near return instructions. Because it includes near returns, it follows
that **RSB entries created before an IBPB command cannot control the
predicted targets of returns executed after the command on the same
logical processor.**" [emphasis mine]
On the other hand, AMD's IBPB "may not prevent return branch
predictions from being specified by pre-IBPB branch targets" [3].
However, some AMD processors have an "enhanced IBPB" [terminology
mine] which does clear the return address predictor. This feature is
enumerated by CPUID.80000008:EDX.IBPB_RET[bit 30] [4].
Adjust the cross-vendor features enumerated by KVM_GET_SUPPORTED_CPUID
accordingly.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/cpuid-enumeration-and-architectural-msrs.html
[2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/speculative-execution-side-channel-mitigations.html#Footnotes
[3] https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1040.html
[4] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf
Fixes: 0c54914d0c52 ("KVM: x86: use Intel speculation bugs and features as derived in generic x86 code")
Suggested-by: Venkatesh Srinivas <venkateshs@chromium.org>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241011214353.1625057-5-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
This is an inherent feature of IA32_PRED_CMD[0], so it is trivially
virtualizable (as long as IA32_PRED_CMD[0] is virtualized).
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241011214353.1625057-4-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Wrap kvm_vcpu_exit_request()'s load of vcpu->mode with READ_ONCE() to
ensure the variable is re-loaded from memory, as there is no guarantee the
caller provides the necessary annotations to ensure KVM sees a fresh value,
e.g. the VM-Exit fastpath could theoretically reuse the pre-VM-Enter value.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240828232013.768446-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Change svm_vcpu_run() to vcpu_enter_guest() in the comment of
__kvm_set_or_clear_apicv_inhibit() to make it reflect the fact.
When one thread updates VM's APICv state due to updating the APICv
inhibit reasons, it kicks off all vCPUs and makes them wait until the
new reason has been updated and can be seen by all vCPUs.
There was one WARN() to make sure VM's APICv state is consistent with
vCPU's APICv state in the svm_vcpu_run(). Commit ee49a8932971 ("KVM:
x86: Move SVM's APICv sanity check to common x86") moved that WARN() to
x86 common code vcpu_enter_guest() due to the logic is not unique to
SVM, and added comments to both __kvm_set_or_clear_apicv_inhibit() and
vcpu_enter_guest() to explain this.
However, although the comment in __kvm_set_or_clear_apicv_inhibit()
mentioned the WARN(), it seems forgot to reflect that the WARN() had
been moved to x86 common, i.e., it still mentioned the svm_vcpu_run()
but not vcpu_enter_guest(). Fix it.
Note after the change the first line that contains vcpu_enter_guest()
exceeds 80 characters, but leave it as is to make the diff clean.
Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/e462e7001b8668649347f879c66597d3327dbac2.1728383775.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The sentence "... so that KVM can the AVIC doorbell to ..." doesn't have
a verb. Fix it.
After adding the verb 'use', that line exceeds 80 characters. Thus wrap
the 'to' to the next line.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/666e991edf81e1fccfba9466f3fe65965fcba897.1728383775.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into HEAD
More Qualcomm Arm64 DeviceTree fixes for v6.12
Bring a range of PCIe fixes across the X Elite platform, as well as
marking the NVMe power supply boot-on to avoid glitching the power
supply during boot.
The X Elite CRD audio configuration sees a spelling mistake corrected.
On SM8450 the PCIe 1 PIPE clock definition is corrected, to fix a
regression where this isn't able to acquire it's clocks.
* tag 'qcom-arm64-fixes-for-6.12-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: dts: qcom: x1e80100: fix PCIe5 interconnect
arm64: dts: qcom: x1e80100: fix PCIe4 interconnect
arm64: dts: qcom: x1e80100: Fix up BAR spaces
arm64: dts: qcom: x1e80100-qcp: fix nvme regulator boot glitch
arm64: dts: qcom: x1e80100-microsoft-romulus: fix nvme regulator boot glitch
arm64: dts: qcom: x1e80100-yoga-slim7x: fix nvme regulator boot glitch
arm64: dts: qcom: x1e80100-vivobook-s15: fix nvme regulator boot glitch
arm64: dts: qcom: x1e80100-crd: fix nvme regulator boot glitch
arm64: dts: qcom: x1e78100-t14s: fix nvme regulator boot glitch
arm64: dts: qcom: x1e80100-crd Rename "Twitter" to "Tweeter"
arm64: dts: qcom: x1e80100: Fix PCIe 6a lanes description
arm64: dts: qcom: sm8450 fix PIPE clock specification for pcie1
arm64: dts: qcom: x1e80100: Add Broadcast_AND region in LLCC block
arm64: dts: qcom: x1e80100: fix PCIe5 PHY clocks
arm64: dts: qcom: x1e80100: fix PCIe4 and PCIe6a PHY clocks
Link: https://lore.kernel.org/r/20241101143206.738617-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into HEAD
Qualcomm Arm64 DeviceTree fix for v6.12
This reverts the conversion to use the mailbox binding for RPM IPC
interrupts, as this broke boot on msm8939.
* tag 'qcom-arm64-fixes-for-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: dts: qcom: msm8939: revert use of APCS mbox for RPM
Link: https://lore.kernel.org/r/20241101142414.737828-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into HEAD
Arm SCMI fixes for v6.12(part 2)
Couple of fixes to address slab-use-after-free in scmi_bus_notifier()
via scmi_dev->name and possible incorrect clear channel transport
operation on A2P channel if some sort of P2A only messages are initiated
on A2P channel(occurs when stress tested passing /dev/random to the
channel).
Apart from this, there are fixes to address missing "arm" prefix in the
recently added property max-rx-timeout-ms which was missed in the review
but was identified when further additions to the same binding were
getting reviewed.
* tag 'scmi-fixes-6.12-2' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Use vendor string in max-rx-timeout-ms
dt-bindings: firmware: arm,scmi: Add missing vendor string
firmware: arm_scmi: Reject clear channel request on A2P
firmware: arm_scmi: Fix slab-use-after-free in scmi_bus_notifier()
Link: https://lore.kernel.org/r/20241031172734.3109140-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into HEAD
RISC-V soc fixes for v6.12-rc6
StarFive:
Two minor dts fixes, one setting the correct eth phy delay parameters
and one disabling unused nodes that caused warnings at probe time.
Firmware:
Fix the poll_complete() implementation in the auto-update driver so that
it behaves as the framework expects.
Misc:
Update the maintainer pattern for my dts entry, so that it covers
the specific platforms listed , rather than including all riscv
platforms with the list platforms excluded.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-soc-fixes-for-v6.12-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
MAINTAINERS: invert Misc RISC-V SoC Support's pattern
riscv: dts: starfive: Update ethernet phy0 delay parameter values for Star64
riscv: dts: starfive: disable unused csi/camss nodes
firmware: microchip: auto-update: fix poll_complete() to not report spurious timeout errors
Link: https://lore.kernel.org/r/20241031-colossal-cassette-617817c9bec3@spud
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Currently, RK809's BUCK3 regulator is modelled in the driver as a
configurable regulator with 0.5-2.4V voltage range. But the voltage
setting is not actually applied, because when bit 6 of
PMIC_POWER_CONFIG register is set to 0 (default), BUCK3 output voltage
is determined by the external feedback resistor. Fix this, by setting
bit 6 when voltage selection is set. Existing users which do not
specify voltage constraints in their device trees will not be affected
by this change, since no voltage setting is applied in those cases,
and bit 6 is not enabled.
Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
Link: https://patch.msgid.link/20241017-rk809-dcdc3-v1-1-e3c3de92f39c@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into HEAD
A number of DTS correctnes fixes, to bring down the amount of errors
reported by dtbscheck.
* tag 'v6.12-rockchip-dtsfixes1' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: (23 commits)
arm64: dts: rockchip: Correct GPIO polarity on brcm BT nodes
arm64: dts: rockchip: Drop invalid clock-names from es8388 codec nodes
ARM: dts: rockchip: Fix the realtek audio codec on rk3036-kylin
ARM: dts: rockchip: Fix the spi controller on rk3036
ARM: dts: rockchip: drop grf reference from rk3036 hdmi
ARM: dts: rockchip: fix rk3036 acodec node
arm64: dts: rockchip: remove orphaned pinctrl-names from pinephone pro
arm64: dts: rockchip: remove num-slots property from rk3328-nanopi-r2s-plus
arm64: dts: rockchip: Fix LED triggers on rk3308-roc-cc
arm64: dts: rockchip: Remove #cooling-cells from fan on Theobroma lion
arm64: dts: rockchip: Remove undocumented supports-emmc property
arm64: dts: rockchip: Fix bluetooth properties on Rock960 boards
arm64: dts: rockchip: Fix bluetooth properties on rk3566 box demo
arm64: dts: rockchip: Drop regulator-init-microvolt from two boards
arm64: dts: rockchip: fix i2c2 pinctrl-names property on anbernic-rg353p/v
arm64: dts: rockchip: Fix reset-gpios property on brcm BT nodes
arm64: dts: rockchip: Fix wakeup prop names on PineNote BT node
arm64: dts: rockchip: Remove hdmi's 2nd interrupt on rk3328
arm64: dts: rockchip: Designate Turing RK1's system power controller
arm64: dts: rockchip: Start cooling maps numbering from zero on ROCK 5B
...
Link: https://lore.kernel.org/r/2847150.mvXUDI8C0e@phil
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://github.com/sophgo/linux into HEAD
RISC-V Sophgo Devicetree fixes for v6.12-rc1
Just one minor fix to replace deprecated "snps,nr-gpios"
property with "ngpios" for snps,dw-apb-gpio-port devices.
Signed-off-by: Chen Wang <unicorn_wang@outlook.com>
* tag 'riscv-sophgo-dt-fixes-for-v6.12-rc1' of https://github.com/sophgo/linux:
riscv: dts: Replace deprecated snps,nr-gpios property for snps,dw-apb-gpio-port devices
Link: https://lore.kernel.org/r/MA0P287MB2822A17623C51A558DB948FCFE482@MA0P287MB2822.INDP287.PROD.OUTLOOK.COM
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into HEAD
i.MX fixes for 6.12:
- An imx8qm change from Alexander Stein to fix VPU IRQs
- An imx8 LVDS subsystem change from Diogo Silva to fix clock-output-names
- An imx8ulp change from Haibo Chen to correct flexspi compatible string
- An imx8mp-skov board change from Liu Ying to set correct clock rate
for media_isp
- An imx8mp-phyboard change from Marek Vasut to correct Video PLL1 frequency
- An imx8mp change from Peng Fan to correct SDHC IPG clock
* tag 'imx-fixes-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
arm64: dts: imx8mp-phyboard-pollux: Set Video PLL1 frequency to 506.8 MHz
arm64: dts: imx8mp: correct sdhc ipg clk
arm64: dts: imx8mp-skov-revb-mi1010ait-1cp1: Assign "media_isp" clock rate
arm64: dts: imx8: Fix lvds0 device tree
arm64: dts: imx8ulp: correct the flexspi compatible string
arm64: dts: imx8-ss-vpu: Fix imx8qm VPU IRQs
Link: https://lore.kernel.org/r/ZxhsnnLudN2kD2Po@dragon
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Add a selftest to check that the tracefs gid mount option is applied
correctly.
./ftracetest test.d/00basic/mount_options.tc
Use the new readme string "[gid=<gid>] as a requirement and also update
test_ownership.tc requirements to use this.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/20241030171928.4168869-4-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Commit ee7f3666995d ("tracefs: Have new files inherit the ownership of
their parent") and commit 48b27b6b5191 ("tracefs: Set all files to the
same group ownership as the mount option") introduced a new gid mount
option that allows specifying a group to apply to all entries in tracefs.
Document this in the tracing readme.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/20241030171928.4168869-3-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API")
converted tracefs to use the new mount APIs caused mount options
(e.g. gid=<gid>) to not take effect.
The tracefs superblock can be updated from multiple paths:
- on fs_initcall() to init_trace_printk_function_export()
- from a work queue to initialize eventfs
tracer_init_tracefs_work_func()
- fsconfig() syscall to mount or remount of tracefs
The tracefs superblock root inode gets created early on in
init_trace_printk_function_export().
With the new mount API, tracefs effectively uses get_tree_single() instead
of the old API mount_single().
Previously, mount_single() ensured that the options are always applied to
the superblock root inode:
(1) If the root inode didn't exist, call fill_super() to create it
and apply the options.
(2) If the root inode exists, call reconfigure_single() which
effectively calls tracefs_apply_options() to parse and apply
options to the subperblock's fs_info and inode and remount
eventfs (if necessary)
On the other hand, get_tree_single() effectively calls vfs_get_super()
which:
(3) If the root inode doesn't exists, calls fill_super() to create it
and apply the options.
(4) If the root inode already exists, updates the fs_context root
with the superblock's root inode.
(4) above is always the case for tracefs mounts, since the super block's
root inode will already be created by init_trace_printk_function_export().
This means that the mount options get ignored:
- Since it isn't applied to the superblock's root inode, it doesn't
get inherited by the children.
- Since eventfs is initialized from a separate work queue and
before call to mount with the options, and it doesn't get remounted
for mount.
Ensure that the mount options are applied to the super block and eventfs
is remounted to respect the mount options.
To understand this better, if fstab has the following:
tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0
On boot up, permissions look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace
When it should look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace
Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 78ff64081949 ("vfs: Convert tracefs to use the new mount API")
Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address hint
and of at least PMD_SIZE will be aligned to PMD so that it can benefit
from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable [2]
[3] and early testing suggests this patch fixes the regression there as
well.
To fix the regression but still try to benefit from THP-friendly anonymous
mapping alignment, add a condition that the size of the mapping must be a
multiple of PMD size instead of at least PMD size. In case of many
odd-sized mapping like the cactusBSSN creates, those will stop being
aligned and with gaps between, and instead naturally merge again.
Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Michael Matz <matz@suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Petr Tesarik <ptesarik@suse.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A memleak was found as below:
unreferenced object 0xffff8881010d2a80 (size 32):
comm "mkdir", pid 1559, jiffies 4294932666
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 @...............
backtrace (crc 2e7ef6fa):
[<ffffffff81372754>] __kmalloc_node_noprof+0x394/0x470
[<ffffffff813024ab>] alloc_shrinker_info+0x7b/0x1a0
[<ffffffff813b526a>] mem_cgroup_css_online+0x11a/0x3b0
[<ffffffff81198dd9>] online_css+0x29/0xa0
[<ffffffff811a243d>] cgroup_apply_control_enable+0x20d/0x360
[<ffffffff811a5728>] cgroup_mkdir+0x168/0x5f0
[<ffffffff8148543e>] kernfs_iop_mkdir+0x5e/0x90
[<ffffffff813dbb24>] vfs_mkdir+0x144/0x220
[<ffffffff813e1c97>] do_mkdirat+0x87/0x130
[<ffffffff813e1de9>] __x64_sys_mkdir+0x49/0x70
[<ffffffff81f8c928>] do_syscall_64+0x68/0x140
[<ffffffff8200012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
alloc_shrinker_info(), when shrinker_unit_alloc() returns an errer, the
info won't be freed. Just fix it.
Link: https://lkml.kernel.org/r/20241025060942.1049263-1-chenridong@huaweicloud.com
Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Wang Weiyang <wangweiyang2@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update e-mail address.
Link: https://lkml.kernel.org/r/20241025085848.483149-1-eugen.hristev@linaro.org
Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. migrate_pages will decrement the
the node's isolated page count, leading to an imbalanced count when
invoked from (MG)LRU code.
The result is dmesg output like such:
$ cat /proc/sys/vm/stat_refresh
[77383.088417] vmstat_refresh: nr_isolated_anon -103212
[77383.088417] vmstat_refresh: nr_isolated_file -899642
This negative value may impact compaction and reclaim throttling.
The following path produces the decrement:
shrink_folio_list
demote_folio_list
migrate_pages
migrate_pages_batch
migrate_folio_move
migrate_folio_done
mod_node_page_state(-ve) <- decrement
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic
is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net
Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim")
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove my previous work email, and the new one. The previous was never
used in the commit log, so there's no good reason to spare it.
Link: https://lkml.kernel.org/r/20241025181530.6151-1-jarkko@kernel.org
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geliang Tang <geliang@kernel.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: Matt Ranostay <matt@ranostay.sg>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull rdma fixes from Jason Gunthorpe:
- Put the QP netlink dump back in cxgb4, fixes a user visible
regression
- Don't change the rounding style in mlx5 for user provided rd_atomic
values
- Resolve a race in bnxt_re around the qp-handle table array
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/bnxt_re: synchronize the qp-handle table array
RDMA/bnxt_re: Fix the usage of control path spin locks
RDMA/mlx5: Round max_rd_atomic/max_dest_rd_atomic up instead of down
RDMA/cxgb4: Dump vendor specific QP details
|
|
Sparse warns:
note: in included file (through ../include/trace/trace_events.h,
../include/trace/define_trace.h,
../drivers/net/ethernet/freescale/dpaa/dpaa_eth_trace.h):
warning: incorrect type in assignment (different base types)
expected unsigned int [usertype] fd_status
got restricted __be32 const [usertype] status
We take struct qm_fd :: status, store it and print it as an u32,
though it is a big endian field. We should print the FD status in
CPU endianness for ease of debug and consistency between PowerPC and
Arm systems.
Though it is a not often used debug feature, it is best to treat it as
a bug and backport the format change to all supported stable kernels,
for consistency.
Fixes: eb11ddf36eb8 ("dpaa_eth: add trace points")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Link: https://patch.msgid.link/20241029163105.44135-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MAC address of VF can be configured through the mailbox mechanism of
ENETC, but the previous implementation forgot to set the MAC address in
net_device, resulting in the SMAC of the sent frames still being the old
MAC address. Since the MAC address in the hardware has been changed, Rx
cannot receive frames with the DMAC address as the new MAC address. The
most obvious phenomenon is that after changing the MAC address, we can
see that the MAC address of eno0vf0 has not changed through the "ifconfig
eno0vf0" command and the IP address cannot be obtained .
root@ls1028ardb:~# ifconfig eno0vf0 down
root@ls1028ardb:~# ifconfig eno0vf0 hw ether 00:04:9f:3a:4d:56 up
root@ls1028ardb:~# ifconfig eno0vf0
eno0vf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 66:36:2c:3b:87:76 txqueuelen 1000 (Ethernet)
RX packets 794 bytes 69239 (69.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11 bytes 2226 (2.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Fixes: beb74ac878c8 ("enetc: Add vf to pf messaging support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20241029090406.841836-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to force a checkpoint when the program's jump
history becomes too long (Eduard Zingerman)
- Add several fixes to the BPF bits iterator addressing issues like
memory leaks and overflow problems (Hou Tao)
- Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong)
- Fix BPF test infra's LIVE_FRAME frame update after a page has been
recycled (Toke Høiland-Jørgensen)
- Fix BPF verifier and undo the 40-bytes extra stack space for
bpf_fastcall patterns due to various bugs (Eduard Zingerman)
- Fix a BPF sockmap race condition which could trigger a NULL pointer
dereference in sock_map_link_update_prog (Cong Wang)
- Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under
the socket lock (Jiayuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
selftests/bpf: Add three test cases for bits_iter
bpf: Use __u64 to save the bits in bits iterator
bpf: Check the validity of nr_words in bpf_iter_bits_new()
bpf: Add bpf_mem_alloc_check_size() helper
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
selftests/bpf: Add test for trie_get_next_key()
bpf: Fix out-of-bounds write in trie_get_next_key()
selftests/bpf: Test with a very short loop
bpf: Force checkpoint when jmp history is too long
bpf: fix filed access without lock
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi, bluetooth and netfilter.
No known new regressions outstanding.
Current release - regressions:
- wifi: mt76: do not increase mcu skb refcount if retry is not
supported
Current release - new code bugs:
- wifi:
- rtw88: fix the RX aggregation in USB 3 mode
- mac80211: fix memory corruption bug in struct ieee80211_chanctx
Previous releases - regressions:
- sched:
- stop qdisc_tree_reduce_backlog on TC_H_ROOT
- sch_api: fix xa_insert() error path in tcf_block_get_ext()
- wifi:
- revert "wifi: iwlwifi: remove retry loops in start"
- cfg80211: clear wdev->cqm_config pointer on free
- netfilter: fix potential crash in nf_send_reset6()
- ip_tunnel: fix suspicious RCU usage warning in ip_tunnel_find()
- bluetooth: fix null-ptr-deref in hci_read_supported_codecs
- eth: mlxsw: add missing verification before pushing Tx header
- eth: hns3: fixed hclge_fetch_pf_reg accesses bar space out of
bounds issue
Previous releases - always broken:
- wifi: mac80211: do not pass a stopped vif to the driver in
.get_txpower
- netfilter: sanitize offset and length before calling skb_checksum()
- core:
- fix crash when config small gso_max_size/gso_ipv4_max_size
- skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
- mptcp: protect sched with rcu_read_lock
- eth: ice: fix crash on probe for DPLL enabled E810 LOM
- eth: macsec: fix use-after-free while sending the offloading packet
- eth: stmmac: fix unbalanced DMA map/unmap for non-paged SKB data
- eth: hns3: fix kernel crash when 1588 is sent on HIP08 devices
- eth: mtk_wed: fix path of MT7988 WO firmware"
* tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
net: hns3: fix kernel crash when 1588 is sent on HIP08 devices
net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue
net: hns3: initialize reset_timer before hclgevf_misc_irq_init()
net: hns3: don't auto enable misc vector
net: hns3: Resolved the issue that the debugfs query result is inconsistent.
net: hns3: fix missing features due to dev->features configuration too early
net: hns3: fixed reset failure issues caused by the incorrect reset type
net: hns3: add sync command to sync io-pgtable
net: hns3: default enable tx bounce buffer when smmu enabled
netfilter: nft_payload: sanitize offset and length before calling skb_checksum()
net: ethernet: mtk_wed: fix path of MT7988 WO firmware
selftests: forwarding: Add IPv6 GRE remote change tests
mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address
mlxsw: pci: Sync Rx buffers for device
mlxsw: pci: Sync Rx buffers for CPU
mlxsw: spectrum_ptp: Add missing verification before pushing Tx header
net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6()
netfilter: Fix use-after-free in get_info()
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes
Mediatek DRM Fixes - 20241028
1. Fix degradation problem of alpha blending
2. Fix color format MACROs in OVL
3. Fix get efuse issue for MT8188 DPTX
4. Fix potential NULL dereference in mtk_crtc_destroy()
5. Correct dpi power-domains property
6. Add split subschema property constraints
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241028135846.3570-1-chunkuang.hu@kernel.org
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.12-2024-10-31:
amdgpu:
- DCN 3.5 fix
- Vangogh SMU KASAN fix
- SMU 13 profile reporting fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241031151539.3523633-1-alexander.deucher@amd.com
|
|
Improper use of userspace_irqchip_in_use led to syzbot hitting the
following WARN_ON() in kvm_timer_update_irq():
WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
The following sequence led to the scenario:
- Userspace creates a VM and a vCPU.
- The vCPU is initialized with KVM_ARM_VCPU_PMU_V3 during
KVM_ARM_VCPU_INIT.
- Without any other setup, such as vGIC or vPMU, userspace issues
KVM_RUN on the vCPU. Since the vPMU is requested, but not setup,
kvm_arm_pmu_v3_enable() fails in kvm_arch_vcpu_run_pid_change().
As a result, KVM_RUN returns after enabling the timer, but before
incrementing 'userspace_irqchip_in_use':
kvm_arch_vcpu_run_pid_change()
ret = kvm_arm_pmu_v3_enable()
if (!vcpu->arch.pmu.created)
return -EINVAL;
if (ret)
return ret;
[...]
if (!irqchip_in_kernel(kvm))
static_branch_inc(&userspace_irqchip_in_use);
- Userspace ignores the error and issues KVM_ARM_VCPU_INIT again.
Since the timer is already enabled, control moves through the
following flow, ultimately hitting the WARN_ON():
kvm_timer_vcpu_reset()
if (timer->enabled)
kvm_timer_update_irq()
if (!userspace_irqchip())
ret = kvm_vgic_inject_irq()
ret = vgic_lazy_init()
if (unlikely(!vgic_initialized(kvm)))
if (kvm->arch.vgic.vgic_model !=
KVM_DEV_TYPE_ARM_VGIC_V2)
return -EBUSY;
WARN_ON(ret);
Theoretically, since userspace_irqchip_in_use's functionality can be
simply replaced by '!irqchip_in_kernel()', get rid of the static key
to avoid the mismanagement, which also helps with the syzbot issue.
Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:
ivpu:
- Fix firewall IRQ handling
panthor:
- Fix firmware initialization wrt page sizes
- Fix handling and reporting of dead job groups
sched:
- Guarantee forward progress via WC_MEM_RECLAIM
tests:
- Fix memory leak in drm_display_mode_from_cea_vic()
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20241031144348.GA7826@linux-2.fritz.box
|
|
Start reprogramming PMU events at nested boundaries now that everything
is in place to handle the EL2 event filter. Only repaint events where
the filter differs between EL1 and EL2 as a slight optimization.
PMU now 'works' for nested VMs, albeit slow.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182559.3364829-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
It hopefully comes as no surprise when I say that vEL2 actually runs at
EL1. So, the guest hypervisor's EL2 event filter (NSH) needs to actually
be applied to EL1 in the perf event. In addition to this, the disable
bit for the guest counter range (HPMD) needs to have the effect of
stopping the affected counters.
Do exactly that by stuffing ::exclude_kernel with the combined effect of
these controls. This isn't quite enough yet, as the backing perf events
need to be reprogrammed upon nested ERET/exception entry to remap the
effective filter onto ::exclude_kernel.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-18-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Counters that fall in the hypervisor range (i.e. N >= HPMN) have a
separate control for enabling 64 bit overflow. Take it into account.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-17-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When the PMU is configured with split counter ranges, HPME becomes the
enable bit for the counters reserved for EL2.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-16-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|