Age | Commit message (Collapse) | Author |
|
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
|
|
KVM/riscv changes for 6.14
- Svvptc, Zabha, and Ziccrse extension support for Guest/VM
- Virtualize SBI system suspend extension for Guest/VM
- Trap related exit statstics as SBI PMU firmware counters for Guest/VM
|
|
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
KVM VMX changes for 6.14:
- Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
while L2 is active, while ultimately results in a hardware-accelerated L1
EOI effectively being lost.
- Honor event priority when emulating Posted Interrupt delivery during nested
VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
interrupt.
- Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
prior to every VM-Enter.
- Rework KVM's processing of the Page-Modification Logging buffer to reap
entries in the same order they were created, i.e. to mark gfns dirty in the
same order that hardware marked the page/PTE dirty.
- Misc cleanups.
|
|
KVM SVM changes for 6.14:
- Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is
optimized away when building with less than brilliant compilers.
- Remove a now-redundant TLB flush when guest CR4.PGE changes.
- Use str_enabled_disabled() to replace open coded strings.
|
|
KVM x86 MMU changes for 6.14:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
|
|
HEAD
KVM vcpu_array fixes and cleanups for 6.14:
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
where KVM would return a pointer to a vCPU prior to it being fully online,
and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
- Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a
bug where userspace could coerce KVM into handling the ioctl on a vCPU that
isn't yet onlined.
- Gracefully handle xa_insert() failures even though such failuires should be
impossible in practice.
|
|
KVM kvm_set_memory_region() cleanups and hardening for 6.14:
- Add proper lockdep assertions when setting memory regions.
- Add a dedicated API for setting KVM-internal memory regions.
- Explicitly disallow all flags for KVM-internal memory regions.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.14
1. Clear LLBCTL if secondary mmu mapping changed.
2. Add hypercall service support for usermode VMM.
This is a really small changeset, because the Chinese New Year
(Spring Festival) is coming. Happy New Year!
|
|
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
instead of 1 in kvm_mmu_page_fault().
The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
kvm_handle_page_fault()). They either check if the return value is > 0 (as
in npf_interception()) or pass it further to vcpu_run() to decide whether
to break out of the kernel loop and return to the user when r <= 0.
Therefore, returning any positive value is equivalent to returning 1.
Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
a positive return value.
This is a preparation to allow TDX's EPT violation handler to check the
RET_PF* value and retry internally for RET_PF_RETRY.
No functional changes are intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disallow all flags for KVM-internal memslots as all existing flags require
some amount of userspace interaction to have any meaning. In addition to
guarding against KVM goofs, explicitly disallowing dirty logging of KVM-
internal memslots will (hopefully) allow exempting KVM-internal memslots
from the KVM_MEM_MAX_NR_PAGES limit, which appears to exist purely because
the dirty bitmap operations use a 32-bit index.
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that there's no outer wrapper for __kvm_set_memory_region() and it's
static, drop its double-underscore prefix.
No functional change intended.
Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a dedicated API for setting internal memslots, and have it explicitly
disallow setting userspace memslots. Setting a userspace memslots without
a direct command from userspace would result in all manner of issues.
No functional change intended.
Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add proper lockdep assertions in __kvm_set_memory_region() and
__x86_set_memory_region() instead of relying comments.
Opportunistically delete __kvm_set_memory_region()'s entire function
comment as the API doesn't allocate memory or select a gfn, and the
"mostly for framebuffers" comment hasn't been true for a very long time.
Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Open code kvm_set_memory_region() into its sole caller in preparation for
adding a dedicated API for setting internal memslots.
Oppurtunistically use the fancy new guard(mutex) to avoid a local 'r'
variable.
Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Some VMMs provides special hypercall service in usermode, KVM should not
handle the usermode hypercall service, thus pass it to usermode, let the
usermode VMM handle it.
Here a new code KVM_HCALL_CODE_USER_SERVICE is added for the user-mode
hypercall service, KVM lets all six registers visible to usermode VMM.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
LLBCTL is a separated guest CSR register from host, host exception ERET
instruction will clear the host LLBCTL CSR register, and guest exception
will clear the guest LLBCTL CSR register.
VCPU0 atomic64_fetch_add_unless VCPU1 atomic64_fetch_add_unless
ll.d %[p], %[c]
beq %[p], %[u], 1f
Here secondary mmu mapping is changed, host hpa page is replaced with a
new page. And VCPU1 will execute atomic instruction on the new page.
ll.d %[p], %[c]
beq %[p], %[u], 1f
add.d %[rc], %[p], %[a]
sc.d %[rc], %[c]
add.d %[rc], %[p], %[a]
sc.d %[rc], %[c]
LLBCTL is set on VCPU0 and it represents the memory is not modified by
other VCPUs, sc.d will modify the memory directly.
So clear WCLLB of the guest LLBCTL register when mapping is the changed.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc/IIO driver fixes from Greg KH:
"Here are a bunch of small IIO and interconnect and other driver fixes
to resolve reported issues. Included in here are:
- loads of iio driver fixes as a result of an audit of places where
uninitialized data would leak to userspace.
- other smaller, and normal, iio driver fixes.
- mhi driver fix
- interconnect driver fixes
- pci1xxxx driver fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (32 commits)
misc: microchip: pci1xxxx: Resolve return code mismatch during GPIO set config
misc: microchip: pci1xxxx: Resolve kernel panic during GPIO IRQ handling
interconnect: icc-clk: check return values of devm_kasprintf()
interconnect: qcom: icc-rpm: Set the count member before accessing the flex array
iio: adc: ti-ads1119: fix sample size in scan struct for triggered buffer
iio: temperature: tmp006: fix information leak in triggered buffer
iio: inkern: call iio_device_put() only on mapped devices
iio: adc: ad9467: Fix the "don't allow reading vref if not available" case
iio: adc: at91: call input_free_device() on allocated iio_dev
iio: adc: ad7173: fix using shared static info struct
iio: adc: ti-ads124s08: Use gpiod_set_value_cansleep()
iio: adc: ti-ads1119: fix information leak in triggered buffer
iio: pressure: zpa2326: fix information leak in triggered buffer
iio: adc: rockchip_saradc: fix information leak in triggered buffer
iio: imu: kmx61: fix information leak in triggered buffer
iio: light: vcnl4035: fix information leak in triggered buffer
iio: light: bh1745: fix information leak in triggered buffer
iio: adc: ti-ads8688: fix information leak in triggered buffer
iio: dummy: iio_simply_dummy_buffer: fix information leak in triggered buffer
iio: test: Fix GTS test config
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs fixes from Greg KH:
"Here are some small driver core and debugfs fixes that resolve some
reported problems:
- debugfs runtime error reporting fixes
- topology cpumask race-condition fix
- MAINTAINERS file email update
All of these have been in linux-next this week with no reported
issues"
* tag 'driver-core-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
fs: debugfs: fix open proxy for unsafe files
MAINTAINERS: align Danilo's maintainer entries
topology: Keep the cpumask unchanged when printing cpumap
debugfs: fix missing mutex_destroy() in short_fops case
fs: debugfs: differentiate short fops with proxy ops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are some small staging driver fixes that resolve some reported
issues and have been in my tree for too long due to the holiday break.
They resolve the following issues:
- lots of gpib build-time fixes as reported by testers and 0-day
- gpib logical fixes
- mailmap fix
All of these have been in linux-next for a while, with no reported
issues other than the duplicated change"
* tag 'staging-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: gpib: mite: remove unused global functions
staging: gpib: refer to correct config symbol in tnt4882 Makefile
mailmap: update Bingwu Zhang's email address
staging: gpib: fix address space mixup
staging: gpib: use ioport_map
staging: gpib: fix pcmcia dependencies
staging: gpib: add module author and description fields
staging: gpib: fix Makefiles
staging: gpib: make global 'usec_diff' functions static
staging: gpib: Modify mismatched function name
staging: gpib: Add lower bound check for secondary address
staging: gpib: Fix erroneous removal of blank before newline
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
"Here are three small serial driver fixes tree. They resolve some
reported issues:
- stm32 break control fix
- 8250 runtime pm usage counter fix
- imx driver locking fix
All have been in my tree and linux-next for three weeks now, with no
reported issues"
* tag 'tty-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: stm32: use port lock wrappers for break control
serial: imx: Use uart_port_lock_irq() instead of uart_port_lock()
tty: serial: 8250: Fix another runtime PM usage counter underflow
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes and new device ids for 6.13-rc7.
Included in here are:
- usb serial new device ids
- typec bugfixes for reported issues
- dwc3 driver fixes
- chipidea driver fixes
- gadget driver fixes
- other minor fixes for reported problems.
All of these have been in linux-next for a while, with no reported
issues"
* tag 'usb-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: serial: option: add Neoway N723-EA support
USB: serial: option: add MeiG Smart SRM815
USB: serial: cp210x: add Phoenix Contact UPS Device
usb: typec: fix pm usage counter imbalance in ucsi_ccg_sync_control()
usb-storage: Add max sectors quirk for Nokia 208
usb: gadget: midi2: Reverse-select at the right place
usb: gadget: f_fs: Remove WARN_ON in functionfs_bind
USB: core: Disable LPM only for non-suspended ports
usb: fix reference leak in usb_new_device()
usb: typec: tcpci: fix NULL pointer issue on shared irq case
usb: gadget: u_serial: Disable ep before setting port to null to fix the crash caused by port being null
usb: chipidea: ci_hdrc_imx: decrement device's refcount in .remove() and in the error path of .probe()
usb: typec: ucsi: Set orientation as none when connector is unplugged
usb: gadget: configfs: Ignore trailing LF for user strings to cdev
USB: usblp: return error when setting unsupported protocol
usb: gadget: f_uac2: Fix incorrect setting of bNumEndpoints
usb: typec: tcpm/tcpci_maxim: fix error code in max_contaminant_read_resistance_kohm()
usb: host: xhci-plat: set skip_phy_initialization if software node has XHCI_SKIP_PHY_INIT property
usb: dwc3-am62: Disable autosuspend during remove
usb: dwc3: gadget: fix writing NYET threshold
|
|
Pull kvm fixes from Paolo Bonzini:
"The largest part here is for KVM/PPC, where a NULL pointer dereference
was introduced in the 6.13 merge window and is now fixed.
There's some "holiday-induced lateness", as the s390 submaintainer put
it, but otherwise things looks fine.
s390:
- fix a latent bug when the kernel is compiled in debug mode
- two small UCONTROL fixes and their selftests
arm64:
- always check page state in hyp_ack_unshare()
- align set_id_regs selftest with the fact that ASIDBITS field is RO
- various vPMU fixes for bugs that only affect nested virt
PPC e500:
- Fix a mostly impossible (but just wrong) case where IRQs were never
re-enabled
- Observe host permissions instead of mapping readonly host pages as
guest-writable. This fixes a NULL-pointer dereference in 6.13
- Replace brittle VMA-based attempts at building huge shadow TLB
entries with PTE lookups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: e500: perform hugepage check after looking up the PFN
KVM: e500: map readonly host pages for read
KVM: e500: track host-writability of pages
KVM: e500: use shadow TLB entry as witness for writability
KVM: e500: always restore irqs
KVM: s390: selftests: Add has device attr check to uc_attr_mem_limit selftest
KVM: s390: selftests: Add ucontrol gis routing test
KVM: s390: Reject KVM_SET_GSI_ROUTING on ucontrol VMs
KVM: s390: selftests: Add ucontrol flic attr selftests
KVM: s390: Reject setting flic pfault attributes on ucontrol VMs
KVM: s390: vsie: fix virtual/physical address in unpin_scb()
KVM: arm64: Only apply PMCR_EL0.P to the guest range of counters
KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME change
KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E change
KVM: arm64: Add unified helper for reprogramming counters by mask
KVM: arm64: Always check the state from hyp_ack_unshare()
KVM: arm64: Fix set_id_regs selftest for ASIDBITS becoming unwritable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a #GP in the perf user callchain code caused by a race between
uprobe freeing the task and the bpf profiler unwinding the task's
user stack
* tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
uprobes: Fix race in uprobe_free_utask
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Check whether shadow stack is active before using the ptrace regset
getter
- Remove a wrong BUG_ON in the early static call code which breaks Xen
PVH when booting as dom0
* tag 'x86_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Ensure shadow stack is active before "getting" registers
x86/static-call: Remove early_boot_irqs_disabled check to fix Xen PVH dom0
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: three small bugfixes
Fix a latent bug when the kernel is compiled in debug mode.
Two small UCONTROL fixes and their selftests.
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.13, part #3
- Always check page state in hyp_ack_unshare()
- Align set_id_regs selftest with the fact that ASIDBITS field is RO
- Various vPMU fixes for bugs that only affect nested virt
|
|
The new __kvm_faultin_pfn() function is upset by the fact that e500
KVM ignores host page permissions - __kvm_faultin requires a "writable"
outgoing argument, but e500 KVM is passing NULL.
While a simple fix would be possible that simply allows writable to
be NULL, it is quite ugly to have e500 KVM ignore completely the host
permissions and map readonly host pages as guest-writable. Merge a more
complete fix and remove the VMA-based attempts at building huge shadow TLB
entries. Using a PTE lookup, similar to what is done for x86, is better
and works with remap_pfn_range() because it does not assume that VM_PFNMAP
areas are contiguous. Note that the same incorrect logic is there in
ARM's get_vma_page_shift() and RISC-V's kvm_riscv_gstage_ioremap().
Fortunately, for e500 most of the code is already there; it just has to
be changed to compute the range from find_linux_pte()'s output rather
than find_vma(). The new code works for both VM_PFNMAP and hugetlb
mappings, so the latter is removed.
Patches 2-5 were tested by the reporter, Christian Zigotzky. Since
the difference with v1 is minimal, I am going to send it to Linus
today.
|
|
e500 KVM tries to bypass __kvm_faultin_pfn() in order to map VM_PFNMAP
VMAs as huge pages. This is a Bad Idea because VM_PFNMAP VMAs could
become noncontiguous as a result of callsto remap_pfn_range().
Instead, use the already existing host PTE lookup to retrieve a
valid host-side mapping level after __kvm_faultin_pfn() has
returned. Then find the largest size that will satisfy the
guest's request while staying within a single host PTE.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The new __kvm_faultin_pfn() function is upset by the fact that e500 KVM
ignores host page permissions - __kvm_faultin requires a "writable"
outgoing argument, but e500 KVM is nonchalantly passing NULL.
If the host page permissions do not include writability, the shadow
TLB entry is forcibly mapped read-only.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add the possibility of marking a page so that the UW and SW bits are
force-cleared. This is stored in the private info so that it persists
across multiple calls to kvmppc_e500_setup_stlbe.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvmppc_e500_ref_setup is returning whether the guest TLB entry is writable,
which is than passed to kvm_release_faultin_page. This makes little sense
for two reasons: first, because the function sets up the private data for
the page and the return value feels like it has been bolted on the side;
second, because what really matters is whether the _shadow_ TLB entry is
writable. If it is not writable, the page can be released as non-dirty.
Shift from using tlbe_is_writable(gtlbe) to doing the same check on
the shadow TLB entry.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If find_linux_pte fails, IRQs will not be restored. This is unlikely
to happen in practice since it would have been reported as hanging
hosts, but it should of course be fixed anyway.
Cc: stable@vger.kernel.org
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fix from Masami Hiramatsu:
"Fix to free trace_kprobe objects at a failure path in
__trace_kprobe_create() function. This fixes a memory leak"
* tag 'probes-fixes-v6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix to free objects when failed to copy a symbol
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"One patch to fix error handling in drivetemp driver"
* tag 'hwmon-for-v6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (drivetemp) Fix driver producing garbage data when SCSI errors occur
|
|
Pull block fix from Jens Axboe:
"A single fix for a use-after-free in the BFQ IO scheduler"
* tag 'block-6.13-20250111' of git://git.kernel.dk/linux:
block, bfq: fix waker_bfqq UAF after bfq_split_bfqq()
|
|
Pull io_uring fixes from Jens Axboe:
- Fix for multishot timeout updates only using the updated value for
the first invocation, not subsequent ones
- Silence a false positive lockdep warning
- Fix the eventfd signaling and putting RCU logic
- Fix fault injected SQPOLL setup not clearing the task pointer in the
error path
- Fix local task_work looking at the SQPOLL thread rather than just
signaling the safe variant. Again one of those theoretical issues,
which should be closed up none the less.
* tag 'io_uring-6.13-20250111' of git://git.kernel.dk/linux:
io_uring: don't touch sqd->thread off tw add
io_uring/sqpoll: zero sqd->thread on tctx errors
io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period
io_uring: silence false positive warnings
io_uring/timeout: fix multishot updates
|
|
Pull smb client fix from Steve French:
- fix unneeded session setup retry due to stale password e.g. for DFS
automounts
* tag '6.13-rc6-SMB3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: sync the root session and superblock context passwords before automounting
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"Over the Christmas break a couple of devicetree fixes came in for
Rockchips, Qualcomm and NXP/i.MX. These add some missing board
specific properties, address build time warnings,
The USB/TOG supoprt on X1 Elite regressed, so two earlier DT changes
get reverted for now.
Aside from the devicetree fixes, there is One build fix for the stm32
firewall driver, and a defconfig change to enable SPDIF support for
i.MX"
* tag 'soc-fixes-6.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
firewall: remove misplaced semicolon from stm32_firewall_get_firewall
arm64: dts: rockchip: add hevc power domain clock to rk3328
arm64: dts: rockchip: Fix the SD card detection on NanoPi R6C/R6S
arm64: dts: qcom: sa8775p: fix the secure device bootup issue
Revert "arm64: dts: qcom: x1e80100: enable OTG on USB-C controllers"
Revert "arm64: dts: qcom: x1e80100-crd: enable otg on usb ports"
arm64: dts: qcom: x1e80100: Fix up BAR space size for PCIe6a
Revert "arm64: dts: qcom: x1e78100-t14s: enable otg on usb-c ports"
ARM: dts: imxrt1050: Fix clocks for mmc
ARM: imx_v6_v7_defconfig: enable SND_SOC_SPDIF
arm64: dts: imx95: correct the address length of netcmix_blk_ctrl
arm64: dts: imx8-ss-audio: add fallback compatible string fsl,imx6ull-esai for esai
arm64: dts: rockchip: rename rfkill label for Radxa ROCK 5B
arm64: dts: rockchip: add reset-names for combphy on rk3568
arm64: dts: qcom: sa8775p: Fix the size of 'addr_space' regions
|
|
Maddy is taking over the day-to-day maintenance of powerpc. I will still
be around to help, and as a backup.
Re-order the main POWERPC list to put Maddy first to reflect that.
KVM/powerpc patches will be handled by Maddy via the powerpc tree with
review from Nick, so replace myself with Maddy there.
Remove myself from BPF, leaving Hari & Christophe as maintainers.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
automounting
In some cases, when password2 becomes the working password, the
client swaps the two password fields in the root session struct, but
not in the smb3_fs_context struct in cifs_sb. DFS automounts inherit
fs context from their parent mounts. Therefore, they might end up
getting the passwords in the stale order.
The automount should succeed, because the mount function will end up
retrying with the actual password anyway. But to reduce these
unnecessary session setup retries for automounts, we can sync the
parent context's passwords with the root session's passwords before
duplicating it to the child's fs context.
Cc: stable@vger.kernel.org
Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix corner case bug where ops.dispatch() couldn't extend the
execution of the current task if SCX_OPS_ENQ_LAST is set.
- Fix ops.cpu_release() not being called when a SCX task is preempted
by a higher priority sched class task.
- Fix buitin idle mask being incorrectly left as busy after an idle CPU
is picked and kicked.
- scx_ops_bypass() was unnecessarily using rq_lock() which comes with
rq pinning related sanity checks which could trigger spuriously.
Switch to raw_spin_rq_lock().
* tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Refresh idle masks during idle-to-idle transitions
sched_ext: switch class when preempted by higher priority scheduler
sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
sched_ext: keep running prev when prev->scx.slice != 0
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Cpuset fixes:
- Fix isolated CPUs leaking into sched domains
- Remove now unnecessary kernfs active break which can trigger a
warning
- Comment updates"
* tag 'cgroup-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: remove kernfs active break
cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains
cgroup/cpuset: Remove stale text
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
- Add a WARN_ON_ONCE() on queue_delayed_work_on() on an offline CPU as
such work items won't get executed till the CPU comes back online
* tag 'wq-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: warn if delayed_work is queued to an offlined cpu.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Fix an OF node leak in the code parsing thermal zone DT properties
(Joe Hattori)"
* tag 'thermal-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: of: fix OF node leak in of_thermal_zone_find()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Add two more ACPI IRQ override quirks and update the code using them
to avoid unnecessary overhead (Hans de Goede)"
* tag 'acpi-6.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: acpi_dev_irq_override(): Check DMI match last
ACPI: resource: Add TongFang GM5HG0A to irq1_edge_low_force_override[]
ACPI: resource: Add Asus Vivobook X1504VAP to irq1_level_low_skip_override[]
|
|
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.
As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].
A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).
For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:
CPU 0: 25.7%
CPU 1: 29.3%
CPU 2: 26.5%
CPU 3: 25.5%
CPU 4: 0.0%
CPU 5: 25.5%
CPU 6: 0.0%
CPU 7: 10.5%
To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.
With this change in place, the core utilization in the previous example
becomes the following:
CPU 0: 18.8%
CPU 1: 19.4%
CPU 2: 18.0%
CPU 3: 18.7%
CPU 4: 19.3%
CPU 5: 18.9%
CPU 6: 18.7%
CPU 7: 19.3%
[1] https://github.com/sched-ext/scx/pull/1139
Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|