summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-07KVM: x86: Explicitly zero kvm_caps during vendor module loadSean Christopherson
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_mce_cap on vendor module loadSean Christopherson
Effectively reset supported_mce_cap on vendor module load to ensure that capabilities aren't unintentionally preserved across module reload, e.g. if kvm-intel.ko added a module param to control LMCE support, or if someone somehow managed to load a vendor module that doesn't support LMCE after loading and unloading kvm-intel.ko. Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have a module param for LMCE, and there is no system in the world that supports both kvm-intel.ko and kvm-amd.ko. Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_vm_types on vendor module loadSean Christopherson
Recompute the entire set of supported VM types when a vendor module is loaded, as preserving supported_vm_types across vendor module unload and reload can result in VM types being incorrectly treated as supported. E.g. if a vendor module is loaded with TDP enabled, unloaded, and then reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly retained. Ditto for SEV_VM and SEV_ES_VM and their respective module params in kvm-amd.ko. Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-05-07KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfnsSean Christopherson
WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully handle the unexpected behavior instead of continuing on with dangerous state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot, and so could install a bogus PFN into the guest. The existing code is functionally ok, because kvm_faultin_pfn() pre-checks all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set. And checking only fault->slot would fall apart if KVM ever added a flag or condition that forced emulation, similar to how KVM handles writes to read-only memslots. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error valuesSean Christopherson
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault() to harden KVM against using "uninitialized" values. In quotes because the fields are actually zero-initialized, and zero is a legal value for both page frame numbers and virtual addresses. E.g. failure to set "pfn" prior to creating an SPTE could result in KVM pointing at physical address '0', which is far less desirable than KVM generating a SPTE with reserved PA bits set and thus effectively killing the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faultsSean Christopherson
Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot" fault to ensure that KVM doesn't use a bogus virtual address, e.g. if there *was* a slot but it's unusable (APIC access page), or if there really was no slot, in which case fault->hva will be '0' (which is a legal address for x86). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()Sean Christopherson
Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just after the private versus shared check, so that there's no need to repeatedly query whether or not a slot exists. This also makes it more obvious that, except for private vs. shared attributes, the process of faulting in a pfn simply doesn't apply to gfns without a slot. Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so that it doesn't need to be duplicated in all paths that invoke kvm_handle_noslot_fault(), and to minimize the probability of not stuffing the right fields. Leave the existing handle behind, but convert it to a WARN, to guard against __kvm_faultin_pfn() unexpectedly nullifying fault->slot. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()Sean Christopherson
Move the checks related to the validity of an access to a memslot from the inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This allows emulating accesses to the APIC access page, which don't need to resolve a pfn, even if there is a relevant in-progress mmu_notifier invalidation. Ditto for accesses to KVM internal memslots from L2, which KVM also treats as emulated MMIO. More importantly, this will allow for future cleanup by having the "no memslot" case bail from kvm_faultin_pfn() very early on. Go to rather extreme and gross lengths to make the change a glorified nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the related code is very subtle. E.g. fault->slot can be nullified if it points at the APIC access page, some flows in KVM x86 expect fault->pfn to be KVM_PFN_NOSLOT, while others check only fault->slot, etc. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIOSean Christopherson
Explicitly detect and disallow private accesses to emulated MMIO in kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private() to perform the check. This will allow the page fault path to go straight to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC internal slotsSean Christopherson
Allow mapping KVM's internal memslots used for EPT without unrestricted guest into L2, i.e. allow mapping the hidden TSS and the identity mapped page tables into L2. Unlike the APIC access page, there is no correctness issue with letting L2 access the "hidden" memory. Allowing these memslots to be mapped into L2 fixes a largely theoretical bug where KVM could incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures consistent KVM behavior for L2. If KVM is using TDP, but L1 is using shadow paging for L2, then routing through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs. L2. But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to incorrectly treat an L1 access to the hidden TSS or identity mapped page tables as MMIO. Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't actually prevent exposing KVM's internal memslots to L2, it simply forces KVM to emulate the access. In most cases, that will trigger MMIO, amusingly due to filling vcpu->arch.mmio_gfn, but also because vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC accesses are ok. But the hidden TSS and identity mapped page tables could go either way (MMIO or access the private memslot's backing memory). Alternatively, the inconsistent emulator behavior could be addressed by forcing MMIO emulation for L2 access to all internal memslots, not just to the APIC. But that's arguably less correct than letting L2 access the hidden TSS and identity mapped page tables, not to mention that it's *extremely* unlikely anyone cares what KVM does in this case. From L1's perspective there is R/W memory at those memslots, the memory just happens to be initialized with non-zero data. Making the memory disappear when it is accessed by L2 is far more magical and arbitrary than the memory existing in the first place. The APIC access page is special because KVM _must_ emulate the access to do the right thing (emulate an APIC access instead of reading/writing the APIC access page). And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2") said, it's not just necessary when L1 is accelerating L2's virtual APIC, it's just as important (likely *more* imporant for correctness when L1 is passing through its own APIC to L2. Fixes: 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Move private vs. shared check above slot validity checksSean Christopherson
Prioritize private vs. shared gfn attribute checks above slot validity checks to ensure a consistent userspace ABI. E.g. as is, KVM will exit to userspace if there is no memslot, but emulate accesses to the APIC access page even if the attributes mismatch. Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faultsSean Christopherson
WARN and skip the emulated MMIO fastpath if a private, reserved page fault is encountered, as private+reserved should be an impossible combination (KVM should never create an MMIO SPTE for a private access). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: check for invalid async page faults involving private memoryPaolo Bonzini
Right now the error code is not used when an async page fault is completed. This is not a problem in the current code, but it is untidy. For protected VMs, we will also need to check that the page attributes match the current state of the page, because asynchronous page faults can only occur on shared pages (private pages go through kvm_faultin_pfn_private() instead of __gfn_to_pfn_memslot()). Start by piping the error code from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the architecture-specific async page fault data. For now, it can be used to assert that there are no async page faults on private memory. Extracted from a patch by Isaku Yamahata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Use synthetic page fault error code to indicate private faultsSean Christopherson
Add and use a synthetic, KVM-defined page fault error code to indicate whether a fault is to private vs. shared memory. TDX and SNP have different mechanisms for reporting private vs. shared, and KVM's software-protected VMs have no mechanism at all. Usurp an error code flag to avoid having to plumb another parameter to kvm_mmu_page_fault() and friends. Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it for TDX and software-protected VMs as appropriate, but that would require *clearing* the flag for SEV and SEV-ES VMs, which support encrypted memory at the hardware layer, but don't utilize private memory at the KVM layer. Opportunistically add a comment to call out that the logic for software- protected VMs is (and was before this commit) broken for nested MMUs, i.e. for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with nested MMUs as there is a _lot_ of functionality that simply doesn't work for software-protected VMs, e.g. all of the paths where KVM accesses guest memory need to be updated to be aware of private vs. shared memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240228024147.41573-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zeroSean Christopherson
WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF, as the error code for #PF is limited to 32 bits (and in practice, 16 bits on Intel CPUS). This behavior is architectural, is part of KVM's ABI (see kvm_vcpu_events.error_code), and is explicitly documented as being preserved for intecerpted #PF in both the APM: The error code saved in EXITINFO1 is the same as would be pushed onto the stack by a non-intercepted #PF exception in protected mode. and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a 32-bit field. Simply drop the upper bits if hardware provides garbage, as spurious information should do no harm (though in all likelihood hardware is buggy and the kernel is doomed). Handling all upper 32 bits in the #PF path will allow moving the sanity check on synthetic checks from kvm_mmu_page_fault() to npf_interception(), which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's PFERR_GUEST_ENC_MASK without running afoul of the sanity check. Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs) and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest bit minimizes the probability of a collision with the "other" vendor, without needing to plumb more bits through microcode. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Pass full 64-bit error code when handling page faultsIsaku Yamahata
Plumb the full 64-bit error code throughout the page fault handling code so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK will be used to determine whether or not a fault is private vs. shared. Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change the behavior of permission_fault() when invoked in the page fault path, as KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault(). Continue passing '0' from the async #PF worker, as guest_memfd and thus private memory doesn't support async page faults. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [mdr: drop references/changes on rebase, update commit message] Signed-off-by: Michael Roth <michael.roth@amd.com> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240228024147.41573-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handlerSean Christopherson
Move the sanity check that hardware never sets bits that collide with KVM- define synthetic bits from kvm_mmu_page_fault() to npf_interception(), i.e. make the sanity check #NPF specific. The legacy #PF path already WARNs if _any_ of bits 63:32 are set, and the error code that comes from VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's EXIT_QUALIFICATION into error code flags). Add a compile-time assert in the legacy #PF handler to make sure that KVM- define flags are covered by its existing sanity check on the upper bits. Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we are removing the comment that defined it. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240228024147.41573-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Define more SEV+ page fault error bits/flags for #NPFSean Christopherson
Define more #NPF error code flags that are relevant to SEV+ (mostly SNP) guests, as specified by the APM: * Bit 31 (RMP): Set to 1 if the fault was caused due to an RMP check or a VMPL check failure, 0 otherwise. * Bit 34 (ENC): Set to 1 if the guest’s effective C-bit was 1, 0 otherwise. * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between PVALIDATE or RMPADJUST and the RMP, 0 otherwise. * Bit 36 (VMPL): Set to 1 if the fault was caused by a VMPL permission check failure, 0 otherwise. Note, the APM is *extremely* misleading, and strongly implies that the above flags can _only_ be set for #NPF exits from SNP guests. That is a lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_ flavor of SEV guest on SNP capable hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Remove separate "bit" defines for page fault error code masksSean Christopherson
Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240228024147.41573-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulationSean Christopherson
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault triggers emulation of any kind, as KVM doesn't currently support emulating access to guest private memory. Practically speaking, private faults and emulation are already mutually exclusive, but there are many flow that can result in KVM returning RET_PF_EMULATE, and adding one last check to harden against weird, unexpected combinations and/or KVM bugs is inexpensive. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-06LoongArch: KVM: Add mmio trace events supportBibo Mao
Add mmio trace events support, currently generic mmio events KVM_TRACE_MMIO_WRITE/xxx_READ/xx_READ_UNSATISFIED are added here. Also vcpu id field is added for all kvm trace events, since perf KVM tool parses vcpu id information for kvm entry event. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add software breakpoint supportBibo Mao
When VM runs in kvm mode, system will not exit to host mode when executing a general software breakpoint instruction such as INSN_BREAK, trap exception happens in guest mode rather than host mode. In order to debug guest kernel on host side, one mechanism should be used to let VM exit to host mode. Here a hypercall instruction with a special code is used for software breakpoint usage. VM exits to host mode and kvm hypervisor identifies the special hypercall code and sets exit_reason with KVM_EXIT_DEBUG. And then let qemu handle it. Idea comes from ppc kvm, one api KVM_REG_LOONGARCH_DEBUG_INST is added to get the hypercall code. VMM needs get sw breakpoint instruction with this api and set the corresponding sw break point for guest kernel. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add PV IPI support on guest sideBibo Mao
PARAVIRT config option and PV IPI is added for the guest side, function pv_ipi_init() is used to add IPI sending and IPI receiving hooks. This function firstly checks whether system runs in VM mode, and if kernel runs in VM mode, it will call function kvm_para_available() to detect the current hypervirsor type (now only KVM type detection is supported). The paravirt functions can work only if current hypervisor type is KVM, since there is only KVM supported on LoongArch now. PV IPI uses virtual IPI sender and virtual IPI receiver functions. With virtual IPI sender, IPI message is stored in memory rather than emulated HW. IPI multicast is also supported, and 128 vcpus can received IPIs at the same time like X86 KVM method. Hypercall method is used for IPI sending. With virtual IPI receiver, HW SWI0 is used rather than real IPI HW. Since VCPU has separate HW SWI0 like HW timer, there is no trap in IPI interrupt acknowledge. Since IPI message is stored in memory, there is no trap in getting IPI message. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add PV IPI support on host sideBibo Mao
On LoongArch system, IPI hw uses iocsr registers. There are one iocsr register access on IPI sending, and two iocsr access on IPI receiving for the IPI interrupt handler. In VM mode all iocsr accessing will cause VM to trap into hypervisor. So with one IPI hw notification there will be three times of trap. In this patch PV IPI is added for VM, hypercall instruction is used for IPI sender, and hypervisor will inject an SWI to the destination vcpu. During the SWI interrupt handler, only CSR.ESTAT register is written to clear irq. CSR.ESTAT register access will not trap into hypervisor, so with PV IPI supported, there is one trap with IPI sender, and no trap with IPI receiver, there is only one trap with IPI notification. Also this patch adds IPI multicast support, the method is similar with x86. With IPI multicast support, IPI notification can be sent to at most 128 vcpus at one time. It greatly reduces the times of trapping into hypervisor. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add vcpu mapping from physical cpuidBibo Mao
Physical CPUID is used for interrupt routing for irqchips such as ipi, msgint and eiointc interrupt controllers. Physical CPUID is stored at the CSR register LOONGARCH_CSR_CPUID, it can not be changed once vcpu is created and the physical CPUIDs of two vcpus cannot be the same. Different irqchips have different size declaration about physical CPUID, the max CPUID value for CSR LOONGARCH_CSR_CPUID on Loongson-3A5000 is 512, the max CPUID supported by IPI hardware is 1024, while for eiointc irqchip is 256, and for msgint irqchip is 65536. The smallest value from all interrupt controllers is selected now, and the max cpuid size is defines as 256 by KVM which comes from the eiointc irqchip. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add cpucfg area for kvm hypervisorBibo Mao
Instruction cpucfg can be used to get processor features. And there is a trap exception when it is executed in VM mode, and also it can be used to provide cpu features to VM. On real hardware cpucfg area 0 - 20 is used by now. Here one specified area 0x40000000 -- 0x400000ff is used for KVM hypervisor to provide PV features, and the area can be extended for other hypervisors in future. This area will never be used for real HW, it is only used by software. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch: KVM: Add hypercall instruction emulationBibo Mao
On LoongArch system, there is a hypercall instruction special for virtualization. When system executes this instruction on host side, there is an illegal instruction exception reported, however it will trap into host when it is executed in VM mode. When hypercall is emulated, A0 register is set with value KVM_HCALL_INVALID_CODE, rather than inject EXCCODE_INE invalid instruction exception. So VM can continue to executing the next code. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-06LoongArch/smp: Refine some ipi functions on LoongArch platformBibo Mao
Refine the ipi handling on LoongArch platform, there are three modifications: 1. Add generic function get_percpu_irq(), replacing some percpu irq functions such as get_ipi_irq()/get_pmc_irq()/get_timer_irq() with get_percpu_irq(). 2. Change definition about parameter action called by function loongson_send_ipi_single() and loongson_send_ipi_mask(), and it is defined as decimal encoding format at ipi sender side. Normal decimal encoding is used rather than binary bitmap encoding for ipi action, ipi hw sender uses decimal encoding code, and ipi receiver will get binary bitmap encoding, the ipi hw will convert it into bitmap in ipi message buffer. 3. Add a structure smp_ops on LoongArch platform so that pv ipi can be used later. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-05-05Linux 6.9-rc7v6.9-rc7Linus Torvalds
2024-05-05epoll: be better about file lifetimesLinus Torvalds
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-05Merge tag 'edac_urgent_for_v6.9_rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fixes from Borislav Petkov: - Fix error logging and check user-supplied data when injecting an error in the versal EDAC driver * tag 'edac_urgent_for_v6.9_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/versal: Do not log total error counts EDAC/versal: Check user-supplied data before injecting an error EDAC/versal: Do not register for NOC errors
2024-05-05Merge tag 'powerpc-6.9-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix incorrect delay handling in the plpks (keystore) code - Fix a panic when an LPAR boots with a frozen PE Thanks to Andrew Donnellan, Gaurav Batra, Nageswara R Sastry, and Nayna Jain. * tag 'powerpc-6.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/pseries/iommu: LPAR panics during boot up with a frozen PE powerpc/pseries: make max polling consistent for longer H_CALLs
2024-05-05Merge tag 'x86-urgent-2024-05-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Remove the broken vsyscall emulation code from the page fault code - Fix kexec crash triggered by certain SEV RMP table layouts - Fix unchecked MSR access error when disabling the x2APIC via iommu=off * tag 'x86-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove broken vsyscall emulation code from the page fault code x86/apic: Don't access the APIC when disabling x2APIC x86/sev: Add callback to apply RMP table fixups for kexec x86/e820: Add a new e820 table update helper
2024-05-05Merge tag 'irq-urgent-2024-05-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix suspicious RCU usage in __do_softirq()" * tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: softirq: Fix suspicious RCU usage in __do_softirq()
2024-05-05Merge tag 'char-misc-6.9-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char/misc/other driver fixes and new device ids for 6.9-rc7 that resolve some reported problems. Included in here are: - iio driver fixes - mei driver fix and new device ids - dyndbg bugfix - pvpanic-pci driver bugfix - slimbus driver bugfix - fpga new device id All have been in linux-next with no reported problems" * tag 'char-misc-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: slimbus: qcom-ngd-ctrl: Add timeout for wait operation dyndbg: fix old BUG_ON in >control parser misc/pvpanic-pci: register attributes via pci_driver fpga: dfl-pci: add PCI subdevice ID for Intel D5005 card mei: me: add lunar lake point M DID mei: pxp: match against PCI_CLASS_DISPLAY_OTHER iio:imu: adis16475: Fix sync mode setting iio: accel: mxc4005: Reset chip on probe() and resume() iio: accel: mxc4005: Interrupt handling fixes dt-bindings: iio: health: maxim,max30102: fix compatible check iio: pressure: Fixes SPI support for BMP3xx devices iio: pressure: Fixes BME280 SPI driver data
2024-05-05Merge tag 'usb-6.9-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB driver fixes from Greg KH: "Here are some small USB driver fixes for reported problems for 6.9-rc7. Included in here are: - usb core fixes for found issues - typec driver fixes for reported problems - usb gadget driver fixes for reported problems - xhci build fixes - dwc3 driver fixes for reported issues All of these have been in linux-next this past week with no reported problems" * tag 'usb-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: typec: tcpm: Check for port partner validity before consuming it usb: typec: tcpm: enforce ready state when queueing alt mode vdm usb: typec: tcpm: unregister existing source caps before re-registration usb: typec: tcpm: clear pd_event queue in PORT_RESET usb: typec: tcpm: queue correct sop type in tcpm_queue_vdm_unlocked usb: Fix regression caused by invalid ep0 maxpacket in virtual SuperSpeed device usb: ohci: Prevent missed ohci interrupts usb: typec: qcom-pmic: fix pdphy start() error handling usb: typec: qcom-pmic: fix use-after-free on late probe errors usb: gadget: f_fs: Fix a race condition when processing setup packets. USB: core: Fix access violation during port device removal usb: dwc3: core: Prevent phy suspend during init usb: xhci-plat: Don't include xhci.h usb: gadget: uvc: use correct buffer size when parsing configfs lists usb: gadget: composite: fix OS descriptors w_value logic usb: gadget: f_fs: Fix race between aio_cancel() and AIO request complete
2024-05-05Merge tag 'input-for-v6.9-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: - a new ID for ASUS ROG RAIKIRI controllers added to xpad driver - amimouse driver structure annotated with __refdata to prevent section mismatch warnings. * tag 'input-for-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: amimouse - mark driver struct with __refdata to prevent section mismatch Input: xpad - add support for ASUS ROG RAIKIRI
2024-05-05Merge tag 'probes-fixes-v6.9-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - probe-events: Fix memory leak in parsing probe argument. There is a memory leak (forget to free an allocated buffer) in a memory allocation failure path. Fix it to jump to the correct error handling code. * tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
2024-05-05Merge tag 'trace-v6.9-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05Merge tag 'dma-mapping-6.9-2024-05-04' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix the combination of restricted pools and dynamic swiotlb (Will Deacon) * tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
2024-05-05Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A handful of clk driver fixes: - Avoid a deadlock in the Qualcomm clk driver by making the regulator which supplies the GDSC optional - Restore RPM clks on Qualcomm msm8976 by setting num_clks - Fix Allwinner H6 CPU rate changing logic to avoid system crashes by temporarily reparenting the CPU clk to something that isn't being changed - Set a MIPI PLL min/max rate on Allwinner A64 to fix blank screens on some devices - Revert back to of_match_device() in the Samsung clkout driver to get the match data based on the parent device's compatible string" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: samsung: Revert "clk: Use device_get_match_data()" clk: sunxi-ng: a64: Set minimum and maximum rate for PLL-MIPI clk: sunxi-ng: common: Support minimum and maximum rate clk: sunxi-ng: h6: Reparent CPUX during PLL CPUX rate change clk: qcom: smd-rpm: Restore msm8976 num_clk clk: qcom: gdsc: treat optional supplies as optional
2024-05-04eventfs: Have "events" directory get permissions from its parentSteven Rostedt (Google)
The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r----- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 18:55 current_tracer -rw-r----- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w------- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r----- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r----- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r----- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:16 current_tracer -rw-r----- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w------- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r----- 1 root lkp 0 May 1 21:16 set_event Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.161887248@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Do not treat events directory different than other directoriesSteven Rostedt (Google)
Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.002923579@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Do not differentiate the toplevel events directorySteven Rostedt (Google)
The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.846448710@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04tracefs: Still use mount point as default permissions for instancesSteven Rostedt (Google)
If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04tracefs: Reset permissions on remount if permissions are optionsSteven Rostedt (Google)
There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r----- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root root 0 May 1 21:25 dynamic_events -r--r----- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r----- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root tracing 0 May 1 21:25 dynamic_events -r--r----- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs: Free all of the eventfs_inode after RCUSteven Rostedt (Google)
The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04eventfs/tracing: Add callback for release of an eventfs_inodeSteven Rostedt (Google)
Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com> Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-03Merge tag 'cxl-fixes-6.9-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fix from Dave Jiang: "Add missing RCH support for endpoint access_coordinate calculation. A late bug was reported by Robert Richter that the Restricted CXL Host (RCH) support was missing in the CXL endpoint access_coordinate calculation. The missing support causes the topology iterator to stumble over a NULL pointer and triggers a kernel OOPS on a platform with CXL 1.1 support. The fix bypasses RCH topology as the access_coordinate calculation is not necessary since RCH does not support hotplug and the memory region exported should be covered by the HMAT table already. A unit test is also added to cxl_test to check against future regressions on the topology iterator" * tag 'cxl-fixes-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl: Fix cxl_endpoint_get_perf_coordinate() support for RCH