summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2022-12-23KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reachedSean Christopherson
Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozenSean Christopherson
Hoist the is_removed_spte() check above the "level == goal_level" check when walking SPTEs during a TDP MMU page fault to avoid attempting to map a leaf entry if said entry is frozen by a different task/vCPU. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0 Modules linked in: kvm_intel CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0 RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246 RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005 RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000 RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005 R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000 FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <20221213033030.83345-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Don't stuff secondary execution control if it's not supportedSean Christopherson
When stuffing the allowed secondary execution controls for nested VMX in response to CPUID updates, don't set the allowed-1 bit for a feature that isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config. WARN if KVM attempts to manipulate a feature that isn't supported. All features that are currently stuffed are always advertised to L1 for nested VMX if they are supported in KVM's base configuration, and no additional features should ever be added to the CPUID-induced stuffing (updating VMX MSRs in response to CPUID updates is a long-standing KVM flaw that is slowly being fixed). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213062306.667649-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1Sean Christopherson
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the feature is supported in hardware and enabled in KVM's base, non-nested configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported. This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and obviously allows L1 to enable the feature for L2. KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when updating secondary controls in response to KVM_SET_CPUID(2), but (a) that depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction that the guest value must be a strict subset of the supported host value. Although no past commit explicitly enabled nested support for WAITPKG, doing so is safe and functionally correct from an architectural perspective as no additional KVM support is needed to virtualize TPAUSE, UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards VM-Exits to L1 as necessary (commit bf653b78f960, "KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit"). Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in hardware, i.e. always runs both L1 and L2 with the host's power management settings for TPAUSE and UMWAIT. See commit bf09fb6cba4f ("KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL") for more details. Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions") Cc: stable@vger.kernel.org Reported-by: Aaron Lewis <aaronlewis@google.com> Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20221213062306.667649-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberateSean Christopherson
Explicitly drop the result of kvm_vcpu_write_guest() when writing the "launch state" as part of VMCLEAR emulation, and add a comment to call out that KVM's behavior is architecturally valid. Intel's pseudocode effectively says that VMCLEAR is a nop if the target VMCS address isn't in memory, e.g. if the address points at MMIO. Add a FIXME to call out that suppressing failures on __copy_to_user() is wrong, as memory (a memslot) does exist in that case. Punt the issue to the future as open coding kvm_vcpu_write_guest() just to make sure the guest dies with -EFAULT isn't worth the extra complexity. The flaw will need to be addressed if KVM ever does something intelligent on uaccess failures, e.g. to support post-copy demand paging, but in that case KVM will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open code a core KVM helper. No functional change intended. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527765 ("Error handling issues") Fixes: 587d7e72aedc ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221220154224.526568-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: Sanity check inputs to kvm_handle_memory_failure()Sean Christopherson
Add a sanity check in kvm_handle_memory_failure() to assert that a valid x86_exception structure is provided if the memory "failure" wants to propagate a fault into the guest. If a memory failure happens during a direct guest physical memory access, e.g. for nested VMX, KVM hardcodes the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer (because the exception struct would just be filled with garbage). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221220153427.514032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: Simplify kvm_apic_hw_enabledPeng Hao
kvm_apic_hw_enabled() only needs to return bool, there is no place to use the return value of MSR_IA32_APICBASE_ENABLE. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: hyper-v: Fix 'using uninitialized value' Coverity warningVitaly Kuznetsov
In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM halves' and increase 'data_offset' otherwise. Coverity discovered, that in one case both variables are incremented unconditionally. This doesn't seem to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves' data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also takes into account 'hc->fast' but is still worth fixing. To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to 'struct kvm_hv_hcall' as a union and use at call sites. This allows to remove explicit 'data_offset'/'consumed_xmm_halves' parameters from kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries() helpers. Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and is not zeroed, consumers are supposed to initialize the appropriate field if needed. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527764 ("Uninitialized variables") Fixes: 260970862c88 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221208102700.959630-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure raceAdamos Ttofari
When scanning userspace I/OAPIC entries, intercept EOI for level-triggered IRQs if the current vCPU has a pending and/or in-service IRQ for the vector in its local API, even if the vCPU doesn't match the new entry's destination. This fixes a race between userspace I/OAPIC reconfiguration and IRQ delivery that results in the vector's bit being left set in the remote IRR due to the eventual EOI not being forwarded to the userspace I/OAPIC. Commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") fixed the in-kernel IOAPIC, but not the userspace IOAPIC configuration, which has a similar race. Fixes: 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221208094415.12723-1-attofari@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/pmu: Prevent zero period event from being repeatedly releasedLike Xu
The current vPMU can reuse the same pmc->perf_event for the same hardware event via pmc_pause/resume_counter(), but this optimization does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1, in_tx_cp=1"), where event->attr.sample_period is legally zero at creation, thus making the perf call to perf_event_period() meaningless (no need to adjust sample period in this case), and instead causing such reusable perf_events to be repeatedly released and created. Avoid releasing zero sample_period events by checking is_sampling_event() to follow the previously enable/disable optimization. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20221207071506.15733-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-22bpf, x86: Improve PROBE_MEM runtime load checkDave Marchevsky
This patch rewrites the runtime PROBE_MEM check insns emitted by the BPF JIT in order to ensure load safety. The changes in the patch fix two issues with the previous logic and more generally improve size of emitted code. Paragraphs between this one and "FIX 1" below explain the purpose of the runtime check and examine the current implementation. When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the address being loaded from is not necessarily valid. The BPF jit sets up exception handlers for each such load which catch page faults and 0 out the destination register. Arbitrary register-relative loads can escape this exception handling mechanism. Specifically, a load like dst_reg = *(src_reg + off) will not trigger BPF exception handling if (src_reg + off) is outside of kernel address space, resulting in an uncaught page fault. A concrete example of such behavior is a program like: struct result { char space[40]; long a; }; /* if err, returns ERR_PTR(-EINVAL) */ struct result *ptr = get_ptr_maybe_err(); long x = ptr->a; If get_ptr_maybe_err returns ERR_PTR(-EINVAL) and the result isn't checked for err, 'result' will be (u64)-EINVAL, a number close to U64_MAX. The ptr->a load will be > U64_MAX and will wrap over to a small positive u64, which will be in userspace and thus not covered by BPF exception handling mechanism. In order to prevent such loads from occurring, the BPF jit emits some instructions which do runtime checking of (src_reg + off) and skip the actual load if it's out of range. As an example, here are instructions emitted for a %rdi = *(%rdi + 0x10) PROBE_MEM load: 72: movabs $0x800000000010,%r11 --| 7c: cmp %r11,%rdi |- 72 - 7f: Check 1 7f: jb 0x000000000000008d --| 81: mov %rdi,%r11 -----| 84: add $0x0000000000000010,%r11 |- 81-8b: Check 2 8b: jnc 0x0000000000000091 -----| 8d: xor %edi,%edi ---- 0 out dest 8f: jmp 0x0000000000000095 91: mov 0x10(%rdi),%rdi ---- Actual load 95: The JIT considers kernel address space to start at MAX_TASK_SIZE + PAGE_SIZE. Determining whether a load will be outside of kernel address space should be a simple check: (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE But because there is only one spare register when the checking logic is emitted, this logic is split into two checks: Check 1: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE - off) Check 2: src_reg + off doesn't wrap over U64_MAX and result in small pos u64 Emitted insns implementing Checks 1 and 2 are annotated in the above example. Check 1 can be done with a single spare register since the source reg by definition is the left-hand-side of the inequality. Since adding 'off' to both sides of Check 1's inequality results in the original inequality we want, it's equivalent to testing that inequality. Except in the case where src_reg + off wraps past U64_MAX, which is why Check 2 needs to actually add src_reg + off if Check 1 passes - again using the single spare reg. FIX 1: The Check 1 inequality listed above is not what current code is doing. Current code is a bit more pessimistic, instead checking: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE + abs(off)) The 0x800000000010 in above example is from this current check. If Check 1 was corrected to use the correct right-hand-side, the value would be 0x7ffffffffff0. This patch changes the checking logic more broadly (FIX 2 below will elaborate), fixing this issue as a side-effect of the rewrite. Regardless, it's important to understand why Check 1 should've been doing MAX_TASK_SIZE + PAGE_SIZE - off before proceeding. FIX 2: Current code relies on a 'jnc' to determine whether src_reg + off addition wrapped over. For negative offsets this logic is incorrect. Consider Check 2 insns emitted when off = -0x10: 81: mov %rdi,%r11 84: add 0xfffffffffffffff0,%r11 8b: jnc 0x0000000000000091 2's complement representation of -0x10 is a large positive u64. Any value of src_reg that passes Check 1 will result in carry flag being set after (src_reg + off) addition. So a load with any negative offset will always fail Check 2 at runtime and never do the actual load. This patch fixes the negative offset issue by rewriting both checks in order to not rely on carry flag. The rewrite takes advantage of the fact that, while we only have one scratch reg to hold arbitrary values, we know the offset at JIT time. This we can use src_reg as a temporary scratch reg to hold src_reg + offset since we can return it to its original value by later subtracting offset. As a result we can directly check the original inequality we care about: (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE For a load like %rdi = *(%rsi + -0x10), this results in emitted code: 43: movabs $0x800000000000,%r11 4d: add $0xfffffffffffffff0,%rsi --- src_reg += off 54: cmp %r11,%rsi --- Check original inequality 57: jae 0x000000000000005d 59: xor %edi,%edi 5b: jmp 0x0000000000000061 5d: mov 0x0(%rdi),%rsi --- Actual Load 61: sub $0xfffffffffffffff0,%rsi --- src_reg -= off Note that the actual load is always done with offset 0, since previous insns have already done src_reg += off. Regardless of whether the new check succeeds or fails, insn 61 is always executed, returning src_reg to its original value. Because the goal of these checks is to ensure that loaded-from address will be protected by BPF exception handler, the new check can safely ignore any wrapover from insn 4d. If such wrapped-over address passes insn 54 + 57's cmp-and-jmp it will have such protection so the load can proceed. IMPROVEMENTS: The above improved logic is 8 insns vs original logic's 9, and has 1 fewer jmp. The number of checking insns can be further improved in common scenarios: If src_reg == dst_reg, the actual load insn will clobber src_reg, so there's no original src_reg state for the sub insn immediately following the load to restore, so it can be omitted. In fact, it must be omitted since it would incorrectly subtract from the result of the load if it wasn't. So for src_reg == dst_reg, JIT emits these insns: 3c: movabs $0x800000000000,%r11 46: add $0xfffffffffffffff0,%rdi 4d: cmp %r11,%rdi 50: jae 0x0000000000000056 52: xor %edi,%edi 54: jmp 0x000000000000005a 56: mov 0x0(%rdi),%rdi 5a: The only difference from larger example being the omitted sub, which would've been insn 5a in this example. If offset == 0, we can similarly omit the sub as in previous case, since there's nothing added to subtract. For the same reason we can omit the addition as well, resulting in JIT emitting these insns: 46: movabs $0x800000000000,%r11 4d: cmp %r11,%rdi 50: jae 0x0000000000000056 52: xor %edi,%edi 54: jmp 0x000000000000005a 56: mov 0x0(%rdi),%rdi 5a: Although the above example also has src_reg == dst_reg, the same offset == 0 optimization is valid to apply if src_reg != dst_reg. To summarize the improvements in emitted insn count for the check-and-load: BEFORE: 8 check insns, 3 jmps AFTER (general case): 7 check insns, 2 jmps (12.5% fewer insn, 33% jmp) AFTER (src == dst): 6 check insns, 2 jmps (25% fewer insn) AFTER (offset == 0): 5 check insns, 2 jmps (37.5% fewer insn) (Above counts don't include the 1 load insn, just checking around it) Based on BPF bytecode + JITted x86 insn I saw while experimenting with these improvements, I expect the src_reg == dst_reg case to occur most often, followed by offset == 0, then the general case. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221216214319.3408356-1-davemarchevsky@fb.com
2022-12-20prandom: remove prandom_u32_max()Jason A. Donenfeld
Convert the final two users of prandom_u32_max() that slipped in during 6.2-rc1 to use get_random_u32_below(). Then, with no more users left, we can finally remove the deprecated function. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-12-19Merge tag 'kbuild-v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Support zstd-compressed debug info - Allow W=1 builds to detect objects shared among multiple modules - Add srcrpm-pkg target to generate a source RPM package - Make the -s option detection work for future GNU Make versions - Add -Werror to KBUILD_CPPFLAGS when CONFIG_WERROR=y - Allow W=1 builds to detect -Wundef warnings in any preprocessed files - Raise the minimum supported version of binutils to 2.25 - Use $(intcmp ...) to compare integers if GNU Make >= 4.4 is used - Use $(file ...) to read a file if GNU Make >= 4.2 is used - Print error if GNU Make older than 3.82 is used - Allow modpost to detect section mismatches with Clang LTO - Include vmlinuz.efi into kernel tarballs for arm64 CONFIG_EFI_ZBOOT=y * tag 'kbuild-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (29 commits) buildtar: fix tarballs with EFI_ZBOOT enabled modpost: Include '.text.*' in TEXT_SECTIONS padata: Mark padata_work_init() as __ref kbuild: ensure Make >= 3.82 is used kbuild: refactor the prerequisites of the modpost rule kbuild: change module.order to list *.o instead of *.ko kbuild: use .NOTINTERMEDIATE for future GNU Make versions kconfig: refactor Makefile to reduce process forks kbuild: add read-file macro kbuild: do not sort after reading modules.order kbuild: add test-{ge,gt,le,lt} macros Documentation: raise minimum supported version of binutils to 2.25 kbuild: add -Wundef to KBUILD_CPPFLAGS for W=1 builds kbuild: move -Werror from KBUILD_CFLAGS to KBUILD_CPPFLAGS kbuild: Port silent mode detection to future gnu make. init/version.c: remove #include <generated/utsrelease.h> firmware_loader: remove #include <generated/utsrelease.h> modpost: Mark uuid_le type to be suitable only for MEI kbuild: add ability to make source rpm buildable using koji kbuild: warn objects shared among multiple modules ...
2022-12-19Merge tag 'powerpc-6.2-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add powerpc qspinlock implementation optimised for large system scalability and paravirt. See the merge message for more details - Enable objtool to be built on powerpc to generate mcount locations - Use a temporary mm for code patching with the Radix MMU, so the writable mapping is restricted to the patching CPU - Add an option to build the 64-bit big-endian kernel with the ELFv2 ABI - Sanitise user registers on interrupt entry on 64-bit Book3S - Many other small features and fixes Thanks to Aboorva Devarajan, Angel Iglesias, Benjamin Gray, Bjorn Helgaas, Bo Liu, Chen Lifu, Christoph Hellwig, Christophe JAILLET, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Deming Wang, Disha Goel, Dmitry Torokhov, Finn Thain, Geert Uytterhoeven, Gustavo A. R. Silva, Haowen Bai, Joel Stanley, Jordan Niethe, Julia Lawall, Kajol Jain, Laurent Dufour, Li zeming, Miaoqian Lin, Michael Jeanson, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Miehlbradt, Nicholas Piggin, Pali Rohár, Randy Dunlap, Rohan McLure, Russell Currey, Sathvika Vasireddy, Shaomin Deng, Stephen Kitt, Stephen Rothwell, Thomas Weißschuh, Tiezhu Yang, Uwe Kleine-König, Xie Shaowen, Xiu Jianfeng, XueBing Chen, Yang Yingliang, Zhang Jiaming, ruanjinjie, Jessica Yu, and Wolfram Sang. * tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (181 commits) powerpc/code-patching: Fix oops with DEBUG_VM enabled powerpc/qspinlock: Fix 32-bit build powerpc/prom: Fix 32-bit build powerpc/rtas: mandate RTAS syscall filtering powerpc/rtas: define pr_fmt and convert printk call sites powerpc/rtas: clean up includes powerpc/rtas: clean up rtas_error_log_max initialization powerpc/pseries/eeh: use correct API for error log size powerpc/rtas: avoid scheduling in rtas_os_term() powerpc/rtas: avoid device tree lookups in rtas_os_term() powerpc/rtasd: use correct OF API for event scan rate powerpc/rtas: document rtas_call() powerpc/pseries: unregister VPA when hot unplugging a CPU powerpc/pseries: reset the RCU watchdogs after a LPM powerpc: Take in account addition CPU node when building kexec FDT powerpc: export the CPU node count powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state powerpc/dts/fsl: Fix pca954x i2c-mux node names cxl: Remove unnecessary cxl_pci_window_alignment() selftests/powerpc: Fix resource leaks ...
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-16Merge tag 'driver-core-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const *", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems" * tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits) device property: Fix documentation for fwnode_get_next_parent() firmware_loader: fix up to_fw_sysfs() to preserve const usb.h: take advantage of container_of_const() device.h: move kobj_to_dev() to use container_of_const() container_of: add container_of_const() that preserves const-ness of the pointer driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion. driver core: fix up missed scsi/cxlflash class.devnode() conversion. driver core: fix up some missing class.devnode() conversions. driver core: make struct class.devnode() take a const * driver core: make struct class.dev_uevent() take a const * cacheinfo: Remove of_node_put() for fw_token device property: Add a blank line in Kconfig of tests device property: Rename goto label to be more precise device property: Move PROPERTY_ENTRY_BOOL() a bit down device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*() kernfs: fix all kernel-doc warnings and multiple typos driver core: pass a const * into of_device_uevent() kobject: kset_uevent_ops: make name() callback take a const * kobject: kset_uevent_ops: make filter() callback take a const * kobject: make kobject_namespace take a const * ...
2022-12-15Merge tag 'trace-v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Add options to the osnoise tracer: - 'panic_on_stop' option that panics the kernel if osnoise is greater than some user defined threshold. - 'preempt' option, to test noise while preemption is disabled - 'irq' option, to test noise when interrupts are disabled - Add .percent and .graph suffix to histograms to give different outputs - Add nohitcount to disable showing hitcount in histogram output - Add new __cpumask() to trace event fields to annotate that a unsigned long array is a cpumask to user space and should be treated as one. - Add trace_trigger kernel command line parameter to enable trace event triggers at boot up. Useful to trace stack traces, disable tracing and take snapshots. - Fix x86/kmmio mmio tracer to work with the updates to lockdep - Unify the panic and die notifiers - Add back ftrace_expect reference that is used to extract more information in the ftrace_bug() code. - Have trigger filter parsing errors show up in the tracing error log. - Updated MAINTAINERS file to add kernel tracing mailing list and patchwork info - Use IDA to keep track of event type numbers. - And minor fixes and clean ups * tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits) tracing: Fix cpumask() example typo tracing: Improve panic/die notifiers ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels tracing: Do not synchronize freeing of trigger filter on boot up tracing: Remove pointer (asterisk) and brackets from cpumask_t field tracing: Have trigger filter parsing errors show up in error_log x86/mm/kmmio: Remove redundant preempt_disable() tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_line Documentation/osnoise: Add osnoise/options documentation tracing/osnoise: Add preempt and/or irq disabled options tracing/osnoise: Add PANIC_ON_STOP option Documentation/osnoise: Escape underscore of NO_ prefix tracing: Fix some checker warnings tracing/osnoise: Make osnoise_options static tracing: remove unnecessary trace_trigger ifdef ring-buffer: Handle resize in early boot up tracing/hist: Fix issue of losting command info in error_log tracing: Fix issue of missing one synthetic field tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx' tracing/hist: Fix wrong return value in parse_action_params() ...
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-12-15x86/mm: Ensure forced page table splittingDave Hansen
There are a few kernel users like kfence that require 4k pages to work correctly and do not support large mappings. They use set_memory_4k() to break down those large mappings. That, in turn relies on cpa_data->force_split option to indicate to set_memory code that it should split page tables regardless of whether the need to be. But, a recent change added an optimization which would return early if a set_memory request came in that did not change permissions. It did not consult ->force_split and would mistakenly optimize away the splitting that set_memory_4k() needs. This broke kfence. Skip the same-permission optimization when ->force_split is set. Fixes: 127960a05548 ("x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Marco Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/CA+G9fYuFxZTxkeS35VTZMXwQvohu73W3xbZ5NtjebsVvH6hCuA@mail.gmail.com/
2022-12-15x86/kasan: Populate shadow for shared chunk of the CPU entry areaSean Christopherson
Popuplate the shadow for the shared portion of the CPU entry area, i.e. the read-only IDT mapping, during KASAN initialization. A recent change modified KASAN to map the per-CPU areas on-demand, but forgot to keep a shadow for the common area that is shared amongst all CPUs. Map the common area in KASAN init instead of letting idt_map_in_cea() do the dirty work so that it Just Works in the unlikely event more shared data is shoved into the CPU entry area. The bug manifests as a not-present #PF when software attempts to lookup an IDT entry, e.g. when KVM is handling IRQs on Intel CPUs (KVM performs direct CALL to the IRQ handler to avoid the overhead of INTn): BUG: unable to handle page fault for address: fffffbc0000001d8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 16c03a067 P4D 16c03a067 PUD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 5 PID: 901 Comm: repro Tainted: G W 6.1.0-rc3+ #410 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kasan_check_range+0xdf/0x190 vmx_handle_exit_irqoff+0x152/0x290 [kvm_intel] vcpu_run+0x1d89/0x2bd0 [kvm] kvm_arch_vcpu_ioctl_run+0x3ce/0xa70 [kvm] kvm_vcpu_ioctl+0x349/0x900 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Reported-by: syzbot+8cdd16fd5a6c0565e227@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221110203504.1985010-6-seanjc@google.com
2022-12-15x86/kasan: Add helpers to align shadow addresses up and downSean Christopherson
Add helpers to dedup code for aligning shadow address up/down to page boundaries when translating an address to its shadow. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Link: https://lkml.kernel.org/r/20221110203504.1985010-5-seanjc@google.com
2022-12-15x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten namesSean Christopherson
Rename the CPU entry area variables in kasan_init() to shorten their names, a future fix will reference the beginning of the per-CPU portion of the CPU entry area, and shadow_cpu_entry_per_cpu_begin is a bit much. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Link: https://lkml.kernel.org/r/20221110203504.1985010-4-seanjc@google.com
2022-12-15x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry areaSean Christopherson
Populate a KASAN shadow for the entire possible per-CPU range of the CPU entry area instead of requiring that each individual chunk map a shadow. Mapping shadows individually is error prone, e.g. the per-CPU GDT mapping was left behind, which can lead to not-present page faults during KASAN validation if the kernel performs a software lookup into the GDT. The DS buffer is also likely affected. The motivation for mapping the per-CPU areas on-demand was to avoid mapping the entire 512GiB range that's reserved for the CPU entry area, shaving a few bytes by not creating shadows for potentially unused memory was not a goal. The bug is most easily reproduced by doing a sigreturn with a garbage CS in the sigcontext, e.g. int main(void) { struct sigcontext regs; syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul); syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul); syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul); memset(&regs, 0, sizeof(regs)); regs.cs = 0x1d0; syscall(__NR_rt_sigreturn); return 0; } to coerce the kernel into doing a GDT lookup to compute CS.base when reading the instruction bytes on the subsequent #GP to determine whether or not the #GP is something the kernel should handle, e.g. to fixup UMIP violations or to emulate CLI/STI for IOPL=3 applications. BUG: unable to handle page fault for address: fffffbc8379ace00 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 16c03a067 P4D 16c03a067 PUD 15b990067 PMD 15b98f067 PTE 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 3 PID: 851 Comm: r2 Not tainted 6.1.0-rc3-next-20221103+ #432 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kasan_check_range+0xdf/0x190 Call Trace: <TASK> get_desc+0xb0/0x1d0 insn_get_seg_base+0x104/0x270 insn_fetch_from_user+0x66/0x80 fixup_umip_exception+0xb1/0x530 exc_general_protection+0x181/0x210 asm_exc_general_protection+0x22/0x30 RIP: 0003:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0003:0000000000000000 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000001d0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Reported-by: syzbot+ffb4f000dc2872c93f62@syzkaller.appspotmail.com Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Link: https://lkml.kernel.org/r/20221110203504.1985010-3-seanjc@google.com
2022-12-15x86/mm: Recompute physical address for every page of per-CPU CEA mappingSean Christopherson
Recompute the physical address for each per-CPU page in the CPU entry area, a recent commit inadvertantly modified cea_map_percpu_pages() such that every PTE is mapped to the physical address of the first page. Fixes: 9fd429c28073 ("x86/kasan: Map shadow for percpu pages on demand") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Link: https://lkml.kernel.org/r/20221110203504.1985010-2-seanjc@google.com
2022-12-15x86/mm: Rename __change_page_attr_set_clr(.checkalias)Peter Zijlstra
Now that the checkalias functionality is taken by CPA_NO_CHECK_ALIAS rename the argument to better match is remaining purpose: primary, matching __change_page_attr(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221110125544.661001508%40infradead.org
2022-12-15x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()Peter Zijlstra
There is a cludge in change_page_attr_set_clr() that inhibits propagating NX changes to the aliases (directmap and highmap) -- this is a cludge twofold: - it also inhibits the primary checks in __change_page_attr(); - it hard depends on single bit changes. The introduction of set_memory_rox() triggered this last issue for clearing both _PAGE_RW and _PAGE_NX. Explicitly ignore _PAGE_NX in cpa_process_alias() instead. Fixes: b38994948567 ("x86/mm: Implement native set_memory_rox()") Reported-by: kernel test robot <oliver.sang@intel.com> Debugged-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221110125544.594991716%40infradead.org
2022-12-15x86/mm: Untangle __change_page_attr_set_clr(.checkalias)Peter Zijlstra
The .checkalias argument to __change_page_attr_set_clr() is overloaded and serves two different purposes: - it inhibits the call to cpa_process_alias() -- as suggested by the name; however, - it also serves as 'primary' indicator for __change_page_attr() ( which in turn also serves as a recursion terminator for cpa_process_alias() ). Untangle these by extending the use of CPA_NO_CHECK_ALIAS to all callsites that currently use .checkalias=0 for this purpose. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221110125544.527267183%40infradead.org
2022-12-15x86/mm: Add a few commentsPeter Zijlstra
It's a shame to hide useful comments in Changelogs, add some to the code. Shamelessly stolen from commit: c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221110125544.460677011%40infradead.org
2022-12-15x86/mm: Fix CR3_ADDR_MASKKirill A. Shutemov
The mask must not include bits above physical address mask. These bits are reserved and can be used for other things. Bits 61 and 62 are used for Linear Address Masking. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20221109165140.9137-2-kirill.shutemov%40linux.intel.com
2022-12-15x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macrosPasha Tatashin
Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE. Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only used in x86/*. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com
2022-12-15x86/mm/pae: Get rid of set_64bit()Peter Zijlstra
Recognise that set_64bit() is a special case of our previously introduced pxx_xchg64(), so use that and get rid of set_64bit(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.233481884%40infradead.org
2022-12-15x86_64: Remove pointless set_64bit() usagePeter Zijlstra
The use of set_64bit() in X86_64 only code is pretty pointless, seeing how it's a direct assignment. Remove all this nonsense. [nathanchance: unbreak irte] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.168036718%40infradead.org
2022-12-15x86/mm/pae: Be consistent with pXXp_get_and_clear()Peter Zijlstra
Given that ptep_get_and_clear() uses cmpxchg8b, and that should be by far the most common case, there's no point in having an optimized variant for pmd/pud. Introduce the pxx_xchg64() helper to implement the common logic once. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.103392961%40infradead.org
2022-12-15x86/mm/pae: Use WRITE_ONCE()Peter Zijlstra
Disallow write-tearing, that would be really unfortunate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114425.038102604%40infradead.org
2022-12-15x86/mm/pae: Don't (ab)use atomic64Peter Zijlstra
PAE implies CX8, write readable code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.971450128%40infradead.org
2022-12-15mm: Rename GUP_GET_PTE_LOW_HIGHPeter Zijlstra
Since it no longer applies to only PTEs, rename it to PXX. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.776404066%40infradead.org
2022-12-15mm: Fix pmd_read_atomic()Peter Zijlstra
AFAICT there's no reason to do anything different than what we do for PTEs. Make it so (also affects SH). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.711181252%40infradead.org
2022-12-15x86/mm/pae: Make pmd_t similar to pte_tPeter Zijlstra
Instead of mucking about with at least 2 different ways of fudging it, do the same thing we do for pte_t. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221022114424.580310787%40infradead.org
2022-12-15x86/mm: Implement native set_memory_rox()Peter Zijlstra
Provide a native implementation of set_memory_rox(), avoiding the double set_memory_ro();set_memory_x(); calls. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-12-15mm: Introduce set_memory_rox()Peter Zijlstra
Because endlessly repeating: set_memory_ro() set_memory_x() is getting tedious. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
2022-12-15x86/mm: Do verify W^X at boot upPeter Zijlstra
Straight up revert of commit: a970174d7a10 ("x86/mm: Do not verify W^X at boot up") now that the root cause has been fixed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201058.011279208@infradead.org
2022-12-15x86/ftrace: Remove SYSTEM_BOOTING exceptionsPeter Zijlstra
Now that text_poke is available before ftrace, remove the SYSTEM_BOOTING exceptions. Specifically, this cures a W+X case during boot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.945960823@infradead.org
2022-12-15x86/mm: Use mm_alloc() in poking_init()Peter Zijlstra
Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org
2022-12-15x86/mm: Randomize per-cpu entry areaPeter Zijlstra
Seth found that the CPU-entry-area; the piece of per-cpu data that is mapped into the userspace page-tables for kPTI is not subject to any randomization -- irrespective of kASLR settings. On x86_64 a whole P4D (512 GB) of virtual address space is reserved for this structure, which is plenty large enough to randomize things a little. As such, use a straight forward randomization scheme that avoids duplicates to spread the existing CPUs over the available space. [ bp: Fix le build. ] Reported-by: Seth Jenkins <sethjenkins@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-12-15x86/kasan: Map shadow for percpu pages on demandAndrey Ryabinin
KASAN maps shadow for the entire CPU-entry-area: [CPU_ENTRY_AREA_BASE, CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE] This will explode once the per-cpu entry areas are randomized since it will increase CPU_ENTRY_AREA_MAP_SIZE to 512 GB and KASAN fails to allocate shadow for such big area. Fix this by allocating KASAN shadow only for really used cpu entry area addresses mapped by cea_map_percpu_pages() Thanks to the 0day folks for finding and reporting this to be an issue. [ dhansen: tweak changelog since this will get committed before peterz's actual cpu-entry-area randomization ] Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Yujie Liu <yujie.liu@intel.com> Cc: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210241508.2e203c3d-yujie.liu@intel.com
2022-12-15Merge tag 'gpio-updates-for-v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio updates from Bartosz Golaszewski: "We have a new GPIO multiplexer driver, bunch of driver updates and refactoring in the core GPIO library. GPIO core: - teach gpiolib to work with software nodes for HW description - remove ARCH_NR_GPIOS treewide as we no longer impose any limit on the number of GPIOS since the allocation became entirely dynamic - add support for HW quirks for Cirrus CS42L56 codec, Marvell NFC controller, Freescale PCIe and Ethernet controller, Himax LCDs and Mediatek mt2701 - refactor OF quirk code - some general refactoring of the OF and ACPI code, adding new helpers, minor tweaks and fixes, making fwnode usage consistent etc. GPIO uAPI: - fix an issue where the user-space can trigger a NULL-pointer dereference in the kernel by opening a device file, forcing a driver unbind and then calling one of the syscalls on the associated file descriptor New drivers: - add gpio-latch: a new GPIO multiplexer based on latches connected to other GPIOs Driver updates: - convert i2c GPIO expanders to using .probe_new() - drop the gpio-sta2x11 driver - factor out common code for the ACCES IDIO-16 family of controllers and use this new library wherever applicable in drivers - add DT support to gpio-hisi - allow building gpio-davinci as a module and increase its maxItems property - add support for a new model to gpio-pca9570 - other minor changes to various drivers" * tag 'gpio-updates-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (66 commits) gpio: sim: set a limit on the number of GPIOs gpiolib: protect the GPIO device against being dropped while in use by user-space gpiolib: cdev: fix NULL-pointer dereferences gpiolib: Provide to_gpio_device() helper gpiolib: Unify access to the device properties gpio: Do not include <linux/kernel.h> when not really needed. gpio: pcf857x: Convert to i2c's .probe_new() gpio: pca953x: Convert to i2c's .probe_new() gpio: max732x: Convert to i2c's .probe_new() dt-bindings: gpio: gpio-davinci: Increase maxItems in gpio-line-names gpiolib: ensure that fwnode is properly set gpio: sl28cpld: Replace irqchip mask_invert with unmask_base gpiolib: of: Use correct fwnode for DT-probed chips gpiolib: of: Drop redundant check in of_mm_gpiochip_remove() gpiolib: of: Prepare of_mm_gpiochip_add_data() for fwnode gpiolib: add support for software nodes gpiolib: consolidate GPIO lookups gpiolib: acpi: avoid leaking ACPI details into upper gpiolib layers gpiolib: acpi: teach acpi_find_gpio() to handle data-only nodes gpiolib: acpi: change acpi_find_gpio() to accept firmware node ...
2022-12-14Merge tag 'x86_core_for_v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Add the call depth tracking mitigation for Retbleed which has been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups * tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits) x86/paravirt: Use common macro for creating simple asm paravirt functions x86/paravirt: Remove clobber bitmask from .parainstructions x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit x86/Kconfig: Enable kernel IBT by default x86,pm: Force out-of-line memcpy() objtool: Fix weak hole vs prefix symbol objtool: Optimize elf_dirty_reloc_sym() x86/cfi: Add boot time hash randomization x86/cfi: Boot time selection of CFI scheme x86/ibt: Implement FineIBT objtool: Add --cfi to generate the .cfi_sites section x86: Add prefix symbols for function padding objtool: Add option to generate prefix symbols objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf objtool: Slice up elf_create_section_symbol() kallsyms: Revert "Take callthunks into account" x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces x86/retpoline: Fix crash printing warning x86/paravirt: Fix a !PARAVIRT build warning ...
2022-12-14Merge tag 'v6.2-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Optimise away self-test overhead when they are disabled - Support symmetric encryption via keyring keys in af_alg - Flip hwrng default_quality, the default is now maximum entropy Algorithms: - Add library version of aesgcm - CFI fixes for assembly code - Add arm/arm64 accelerated versions of sm3/sm4 Drivers: - Remove assumption on arm64 that kmalloc is DMA-aligned - Fix selftest failures in rockchip - Add support for RK3328/RK3399 in rockchip - Add deflate support in qat - Merge ux500 into stm32 - Add support for TEE for PCI ID 0x14CA in ccp - Add mt7986 support in mtk - Add MaxLinear platform support in inside-secure - Add NPCM8XX support in npcm" * tag 'v6.2-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (184 commits) crypto: ux500/cryp - delete driver crypto: stm32/cryp - enable for use with Ux500 crypto: stm32 - enable drivers to be used on Ux500 dt-bindings: crypto: Let STM32 define Ux500 CRYP hwrng: geode - Fix PCI device refcount leak hwrng: amd - Fix PCI device refcount leak crypto: qce - Set DMA alignment explicitly crypto: octeontx2 - Set DMA alignment explicitly crypto: octeontx - Set DMA alignment explicitly crypto: keembay - Set DMA alignment explicitly crypto: safexcel - Set DMA alignment explicitly crypto: hisilicon/hpre - Set DMA alignment explicitly crypto: chelsio - Set DMA alignment explicitly crypto: ccree - Set DMA alignment explicitly crypto: ccp - Set DMA alignment explicitly crypto: cavium - Set DMA alignment explicitly crypto: img-hash - Fix variable dereferenced before check 'hdev->req' crypto: arm64/ghash-ce - use frame_push/pop macros consistently crypto: arm64/crct10dif - use frame_push/pop macros consistently crypto: arm64/aes-modes - use frame_push/pop macros consistently ...
2022-12-14Merge tag 'hardening-v6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ...
2022-12-14Merge tag 'pci-v6.2-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Squash portdrv_{core,pci}.c into portdrv.c to ease maintenance and make more things static. - Make portdrv bind to Switch Ports that have AER. Previously, if these Ports lacked MSI/MSI-X, portdrv failed to bind, which meant the Ports couldn't be suspended to low-power states. AER on these Ports doesn't use interrupts, and the AER driver doesn't need to claim them. - Assign PCI domain IDs using ida_alloc(), which makes host bridge add/remove work better. Resource management: - To work better with recent BIOSes that use EfiMemoryMappedIO for PCI host bridge apertures, remove those regions from the E820 map (E820 entries normally prevent us from allocating BARs). In v5.19, we added some quirks to disable E820 checking, but that's not very maintainable. EfiMemoryMappedIO means the OS needs to map the region for use by EFI runtime services; it shouldn't prevent OS from using it. PCIe native device hotplug: - Build pciehp by default if USB4 is enabled, since Thunderbolt/USB4 PCIe tunneling depends on native PCIe hotplug. - Enable Command Completed Interrupt only if supported to avoid user confusion from lspci output that says this is enabled but not supported. - Prevent pciehp from binding to Switch Upstream Ports; this happened because of interaction with acpiphp and caused devices below the Upstream Port to disappear. Power management: - Convert AGP drivers to generic power management. We hope to remove legacy power management from the PCI core eventually. Virtualization: - Fix pci_device_is_present(), which previously always returned "false" for VFs, causing virtio hangs when unbinding the driver. Miscellaneous: - Convert drivers to gpiod API to prepare for dropping some legacy code. - Fix DOE fencepost error for the maximum data object length. Baikal-T1 PCIe controller driver: - Add driver and DT bindings. Broadcom STB PCIe controller driver: - Enable Multi-MSI. - Delay 100ms after PERST# deassert to allow power and clocks to stabilize. - Configure Read Completion Boundary to 64 bytes. Freescale i.MX6 PCIe controller driver: - Initialize PHY before deasserting core reset to fix a regression in v6.0 on boards where the PHY provides the reference. - Fix imx6sx and imx8mq clock names in DT schema. Intel VMD host bridge driver: - Fix Secondary Bus Reset on VMD bridges, which allows reset of NVMe SSDs in VT-d pass-through scenarios. - Disable MSI remapping, which gets re-enabled by firmware during suspend/resume. MediaTek PCIe Gen3 controller driver: - Add MT7986 and MT8195 support. Qualcomm PCIe controller driver: - Add SC8280XP/SA8540P basic interconnect support. Rockchip DesignWare PCIe controller driver: - Base DT schema on common Synopsys schema. Synopsys DesignWare PCIe core: - Collect DT items shared between Root Port and Endpoint (PERST GPIO, PHY info, clocks, resets, link speed, number of lanes, number of iATU windows, interrupt info, etc) to snps,dw-pcie-common.yaml. - Add dma-ranges support for Root Ports and Endpoints. - Consolidate DT resource retrieval for "dbi", "dbi2", "atu", etc. to reduce code duplication. - Add generic names for clocks and resets to encourage more consistent naming across drivers using DesignWare IP. - Stop advertising PTM Responder role for Endpoints, which aren't allowed to be responders. TI J721E PCIe driver: - Add j721s2 host mode ID to DT schema. - Add interrupt properties to DT schema. Toshiba Visconti PCIe controller driver: - Fix interrupts array max constraints in DT schema" * tag 'pci-v6.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (95 commits) x86/PCI: Use pr_info() when possible x86/PCI: Fix log message typo x86/PCI: Tidy E820 removal messages PCI: Skip allocate_resource() if too little space available efi/x86: Remove EfiMemoryMappedIO from E820 map PCI/portdrv: Allow AER service only for Root Ports & RCECs PCI: xilinx-nwl: Fix coding style violations PCI: mvebu: Switch to using gpiod API PCI: pciehp: Enable Command Completed Interrupt only if supported PCI: aardvark: Switch to using devm_gpiod_get_optional() dt-bindings: PCI: mediatek-gen3: add support for mt7986 dt-bindings: PCI: mediatek-gen3: add SoC based clock config dt-bindings: PCI: qcom: Allow 'dma-coherent' property PCI: mt7621: Add sentinel to quirks table PCI: vmd: Fix secondary bus reset for Intel bridges PCI: endpoint: pci-epf-vntb: Fix sparse ntb->reg build warning PCI: endpoint: pci-epf-vntb: Fix sparse build warning for epf_db PCI: endpoint: pci-epf-vntb: Replace hardcoded 4 with sizeof(u32) PCI: endpoint: pci-epf-vntb: Remove unused epf_db_phy struct member PCI: endpoint: pci-epf-vntb: Fix call pci_epc_mem_free_addr() in error path ...