summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)Author
2024-03-04x86/sme: Move early SME kernel encryption handling into .head.textArd Biesheuvel
The .head.text section is the initial primary entrypoint of the core kernel, and is entered with the CPU executing from a 1:1 mapping of memory. Such code must never access global variables using absolute references, as these are based on the kernel virtual mapping which is not active yet at this point. Given that the SME startup code is also called from this early execution context, move it into .head.text as well. This will allow more thorough build time checks in the future to ensure that early startup code only uses RIP-relative references to global variables. Also replace some occurrences of __pa_symbol() [which relies on the compiler generating an absolute reference, which is not guaranteed] and an open coded RIP-relative access with RIP_REL_REF(). Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-18-ardb+git@google.com
2024-03-04x86/boot: Move mem_encrypt= parsing to the decompressorArd Biesheuvel
The early SME/SEV code parses the command line very early, in order to decide whether or not memory encryption should be enabled, which needs to occur even before the initial page tables are created. This is problematic for a number of reasons: - this early code runs from the 1:1 mapping provided by the decompressor or firmware, which uses a different translation than the one assumed by the linker, and so the code needs to be built in a special way; - parsing external input while the entire kernel image is still mapped writable is a bad idea in general, and really does not belong in security minded code; - the current code ignores the built-in command line entirely (although this appears to be the case for the entire decompressor) Given that the decompressor/EFI stub is an intrinsic part of the x86 bootable kernel image, move the command line parsing there and out of the core kernel. This removes the need to build lib/cmdline.o in a special way, or to use RIP-relative LEA instructions in inline asm blocks. This involves a new xloadflag in the setup header to indicate that mem_encrypt=on appeared on the kernel command line. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240227151907.387873-17-ardb+git@google.com
2024-03-01x86/mm: Regularize set_memory_p() parameters and make non-staticMichael Kelley
set_memory_p() is currently static. It has parameters that don't match set_memory_p() under arch/powerpc and that aren't congruent with the other set_memory_* functions. There's no good reason for the difference. Fix this by making the parameters consistent, and update the one existing call site. Make the function non-static and add it to include/asm/set_memory.h so that it is completely parallel to set_memory_np() and is usable in other modules. No functional change. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20240116022008.1023398-3-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240116022008.1023398-3-mhklinux@outlook.com>
2024-03-01x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callbackMichael Kelley
In preparation for temporarily marking pages not present during a transition between encrypted and decrypted, use slow_virt_to_phys() in the hypervisor callback. As long as the PFN is correct, slow_virt_to_phys() works even if the leaf PTE is not present. The existing functions that depend on vmalloc_to_page() all require that the leaf PTE be marked present, so they don't work. Update the comments for slow_virt_to_phys() to note this broader usage and the requirement to work even if the PTE is not marked present. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20240116022008.1023398-2-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240116022008.1023398-2-mhklinux@outlook.com>
2024-02-28x86/sev: Dump SEV_STATUSBorislav Petkov (AMD)
It is, and will be even more useful in the future, to dump the SEV features enabled according to SEV_STATUS. Do so: [ 0.542753] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP [ 0.544425] SEV: Status: SEV SEV-ES SEV-SNP DebugSwap Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20240219094216.GAZdMieDHKiI8aaP3n@fat_crate.local
2024-02-27Merge branch 'x86/urgent' into x86/apic, to resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/intel.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-26Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up ↵Ingo Molnar
dependent tree We are going to queue up a number of patches that depend on fresh changes in x86/sev - merge in that branch to reduce the number of conflicts going forward. Also resolve a current conflict with x86/sev. Conflicts: arch/x86/include/asm/coco.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-24Merge tag 'cxl-fixes-6.8-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of significant fixes for the CXL subsystem. The largest change in this set, that bordered on "new development", is the fix for the fact that the location of the new qos_class attribute did not match the Documentation. The fix ends up deleting more code than it added, and it has a new unit test to backstop basic errors in this interface going forward. So the "red-diff" and unit test saved the "rip it out and try again" response. In contrast, the new notification path for firmware reported CXL errors (CXL CPER notifications) has a locking context bug that can not be fixed with a red-diff. Given where the release cycle stands, it is not comfortable to squeeze in that fix in these waning days. So, that receives the "back it out and try again later" treatment. There is a regression fix in the code that establishes memory NUMA nodes for platform CXL regions. That has an ack from x86 folks. There are a couple more fixups for Linux to understand (reassemble) CXL regions instantiated by platform firmware. The policy around platforms that do not match host-physical-address with system-physical-address (i.e. systems that have an address translation mechanism between the address range reported in the ACPI CEDT.CFMWS and endpoint decoders) has been softened to abort driver load rather than teardown the memory range (can cause system hangs). Lastly, there is a robustness / regression fix for cases where the driver would previously continue in the face of error, and a fixup for PCI error notification handling. Summary: - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9" * tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Fix load failures due to single window creation failure acpi/ghes: Remove CXL CPER notifications cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window cxl/test: Add support for qos_class checking cxl: Fix sysfs export of qos_class for memdev cxl: Remove unnecessary type cast in cxl_qos_class_verify() cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf' cxl/region: Allow out of order assembly of autodiscovered regions cxl/region: Handle endpoint decoders in cxl_region_find_decoder() x86/numa: Fix the sort compare func used in numa_fill_memblks() x86/numa: Fix the address overlap check in numa_fill_memblks() cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
2024-02-22x86/mm/cpa: Warn for set_memory_XXcrypted() VMM failsRick Edgecombe
On TDX it is possible for the untrusted host to cause set_memory_encrypted() or set_memory_decrypted() to fail such that an error is returned and the resulting memory is shared. Callers need to take care to handle these errors to avoid returning decrypted (shared) memory to the page allocator, which could lead to functional or security issues. In terms of security, the problematic case is guest PTEs mapping the shared alias GFNs, since the VMM has control of the shared mapping in the EPT/NPT. Such conversion errors may herald future system instability, but are temporarily survivable with proper handling in the caller. The kernel traditionally makes every effort to keep running, but it is expected that some coco guests may prefer to play it safe security-wise, and panic in this case. To accommodate both cases, warn when the arch breakouts for converting memory at the VMM layer return an error to CPA. Security focused users can rely on panic_on_warn to defend against bugs in the callers. Some VMMs are not known to behave in the troublesome way, so users that would like to terminate on any unusual behavior by the VMM around this will be covered as well. Since the arch breakouts host the logic for handling coco implementation specific errors, an error returned from them means that the set_memory() call is out of options for handling the error internally. Make this the condition to warn about. It is possible that very rarely these functions could fail due to guest memory pressure (in the case of failing to allocate a huge page when splitting a page table). Don't warn in this case because it is a lot less likely to indicate an attack by the host and it is not clear which set_memory() calls should get the same treatment. That corner should be addressed by future work that considers the more general problem and not just papers over a single set_memory() variant. Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/all/20240122184003.129104-1-rick.p.edgecombe%40intel.com
2024-02-22mm: ptdump: have ptdump_check_wx() return boolChristophe Leroy
Have ptdump_check_wx() return true when the check is successful or false otherwise. [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)] Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WXChristophe Leroy
All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/mm: clarify "prev" usage in switch_mm_irqs_off()Yosry Ahmed
In the x86 implementation of switch_mm_irqs_off(), we do not use the "prev" argument passed in by the caller, we use exclusively use "real_prev", which is cpu_tlbstate.loaded_mm. This is not obvious at the first sight. Furthermore, a comment describes a condition that happens when called with prev == next, but this should not affect the function in any way since prev is unused. Apparently, the comment is intended to clarify why we don't rely on prev == next to decide whether we need to update CR3, but again, it is not obvious. The comment also references the fact that leave_mm() calls with prev == NULL and tsk == NULL, but this also shouldn't matter because prev is unused and tsk is only used in one function which has a NULL check. Clarify things by renaming (prev -> unused) and (real_prev -> prev), also move and rewrite the comment as an explanation for why we don't rely on "prev" supplied by the caller in x86 code and use our own. Hopefully this makes reading the code easier. Link: https://lkml.kernel.org/r/20240126080644.1714297-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/mm: delete unused cpu argument to leave_mm()Yosry Ahmed
The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm"), delete it. Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Merge tag 'net-6.8.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...
2024-02-20Merge branch 'for-6.8/cxl-cper' into for-6.8/cxlDan Williams
Pick up CXL CPER notification removal for v6.8-rc6, to return in a later merge window.
2024-02-20x86/pat: Simplify the PAT programming protocolKirill A. Shutemov
The programming protocol for the PAT MSR follows the MTRR programming protocol. However, this protocol is cumbersome and requires disabling caching (CR0.CD=1), which is not possible on some platforms. Specifically, a TDX guest is not allowed to set CR0.CD. It triggers a #VE exception. It turns out that the requirement to follow the MTRR programming protocol for PAT programming is unnecessarily strict. The new Intel Software Developer Manual (http://www.intel.com/sdm) (December 2023) relaxes this requirement, please refer to the section titled "Programming the PAT" for more information. In short, this section provides an alternative PAT update sequence which doesn't need to disable caches around the PAT update but only to flush those caches and TLBs. The AMD documentation does not link PAT programming to MTRR and is there fore, fine too. The kernel only needs to flush the TLB after updating the PAT MSR. The set_memory code already takes care of flushing the TLB and cache when changing the memory type of a page. [ bp: Expand commit message. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240124130650.496056-1-kirill.shutemov@linux.intel.com
2024-02-16x86/numa: Fix the sort compare func used in numa_fill_memblks()Alison Schofield
The compare function used to sort memblks into starting address order fails when the result of its u64 address subtraction gets truncated to an int upon return. The impact of the bad sort is that memblks will be filled out incorrectly. Depending on the set of memblks, a user may see no errors at all but still have a bad fill, or see messages reporting a node overlap that leads to numa init failure: [] node 0 [mem: ] overlaps with node 1 [mem: ] [] No NUMA configuration found Replace with a comparison that can only result in: 1, 0, -1. Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/99dcb3ae87e04995e9f293f6158dc8fa0749a487.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2024-02-16x86/numa: Fix the address overlap check in numa_fill_memblks()Alison Schofield
numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a physical address range. To do so, it first creates a list of existing memblks that overlap that address range. The issue is that it is off by one when comparing to the end of the address range, so memblks that do not overlap are selected. The impact of selecting a memblk that does not actually overlap is that an existing memblk may be filled when the expected action is to do nothing and return NUMA_NO_MEMBLK to the caller. The caller can then add a new NUMA node and memblk. Replace the broken open-coded search for address overlap with the memblock helper memblock_addrs_overlap(). Update the kernel doc and in code comments. Suggested by: "Huang, Ying" <ying.huang@intel.com> Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2024-02-15x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()Hou Tao
When trying to use copy_from_kernel_nofault() to read vsyscall page through a bpf program, the following oops was reported: BUG: unable to handle page fault for address: ffffffffff600000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:copy_from_kernel_nofault+0x6f/0x110 ...... Call Trace: <TASK> ? copy_from_kernel_nofault+0x6f/0x110 bpf_probe_read_kernel+0x1d/0x50 bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d trace_call_bpf+0xc5/0x1c0 perf_call_bpf_enter.isra.0+0x69/0xb0 perf_syscall_enter+0x13e/0x200 syscall_trace_enter+0x188/0x1c0 do_syscall_64+0xb5/0xe0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ...... ---[ end trace 0000000000000000 ]--- The oops is triggered when: 1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall page and invokes copy_from_kernel_nofault() which in turn calls __get_user_asm(). 2) Because the vsyscall page address is not readable from kernel space, a page fault exception is triggered accordingly. 3) handle_page_fault() considers the vsyscall page address as a user space address instead of a kernel space address. This results in the fix-up setup by bpf not being applied and a page_fault_oops() is invoked due to SMAP. Considering handle_page_fault() has already considered the vsyscall page address as a userspace address, fix the problem by disallowing vsyscall page read for copy_from_kernel_nofault(). Originally-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.hHou Tao
Move is_vsyscall_vaddr() into asm/vsyscall.h to make it available for copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c. Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240202103935.3154011-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15x86/mm/numa: Move early mptable evaluation into common codeThomas Gleixner
There is no reason to have the early mptable evaluation conditionally invoked only from the AMD numa topology code. Make it explicit and invoke it from setup_arch() right after the corresponding ACPI init call. Remove the pointless wrapper and invoke x86_init::mpparse::early_parse_smp_config() directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20240212154639.931761608@linutronix.de
2024-02-15x86/mm/numa: Use core domain size on AMDThomas Gleixner
cpuinfo::topo::x86_coreid_bits is about to be phased out. Use the core domain size from the topology information. Add a comment why the early MPTABLE parsing is required and decrapify the loop which sets the APIC ID to node map. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Wang Wendy <wendy.wang@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240212153625.270320718@linutronix.de
2024-02-12x86/mm/ident_map: Use gbpages only where full GB page should be mapped.Steve Wahl
When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
2024-02-06x86/sev: Fix position dependent variable references in startup codeArd Biesheuvel
The early startup code executes from a 1:1 mapping of memory, which differs from the mapping that the code was linked and/or relocated to run at. The latter mapping is not active yet at this point, and so symbol references that rely on it will fault. Given that the core kernel is built without -fPIC, symbol references are typically emitted as absolute, and so any such references occuring in the early startup code will therefore crash the kernel. While an attempt was made to work around this for the early SEV/SME startup code, by forcing RIP-relative addressing for certain global SEV/SME variables via inline assembly (see snp_cpuid_get_table() for example), RIP-relative addressing must be pervasively enforced for SEV/SME global variables when accessed prior to page table fixups. __startup_64() already handles this issue for select non-SEV/SME global variables using fixup_pointer(), which adjusts the pointer relative to a `physaddr` argument. To avoid having to pass around this `physaddr` argument across all functions needing to apply pointer fixups, introduce a macro RIP_RELATIVE_REF() which generates a RIP-relative reference to a given global variable. It is used where necessary to force RIP-relative accesses to global variables. For backporting purposes, this patch makes no attempt at cleaning up other occurrences of this pattern, involving either inline asm or fixup_pointer(). Those will be addressed later. [ bp: Call it "rip_rel_ref" everywhere like other code shortens "rIP-relative reference" and make the asm wrapper __always_inline. ] Co-developed-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/all/20240130220845.1978329-1-kevinloughlin@google.com
2024-02-03x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULTBorislav Petkov (AMD)
It was meant well at the time but nothing's using it so get rid of it. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240202163510.GDZb0Zvj8qOndvFOiZ@fat_crate.local
2024-01-31x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_userXin Li
If the stack frame contains an invalid user context (e.g. due to invalid SS, a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP). From a Linux point of view, this really should be considered a user space failure, so use the standard fault fixup mechanism to intercept the fault, fix up the exception frame, and redirect execution to fred_entrypoint_user. The end result is that it appears just as if the hardware had taken the exception immediately after completing the transition to user space. Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-30-xin3.li@intel.com
2024-01-31x86/fred: Make exc_page_fault() work for FREDH. Peter Anvin (Intel)
On a FRED system, the faulting address (CR2) is passed on the stack, to avoid the problem of transient state. Thus the page fault address is read from the FRED stack frame instead of CR2 when FRED is enabled. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-22-xin3.li@intel.com
2024-01-29x86/fault: Dump RMP table information when RMP page faults occurMichael Roth
RMP faults on kernel addresses are fatal and should never happen in practice. They indicate a bug in the host kernel somewhere. Userspace RMP faults shouldn't occur either, since even for VMs the memory used for private pages is handled by guest_memfd and by design is not mappable by userspace. Dump RMP table information about the PFN corresponding to the faulting HVA to help diagnose any issues of this sort when show_fault_oops() is triggered by an RMP fault. Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240126041126.1927228-10-michael.roth@amd.com
2024-01-29x86/traps: Define RMP violation #PF error codeBrijesh Singh
Bit 31 in the page fault-error bit will be set when processor encounters an RMP violation. While at it, use the BIT() macro. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/20240126041126.1927228-9-michael.roth@amd.com
2024-01-29x86/sme: Fix memory encryption setting if enabled by default and not overriddenArd Biesheuvel
Commit cbebd68f59f0 ("x86/mm: Fix use of uninitialized buffer in sme_enable()") 'fixed' an issue in sme_enable() detected by static analysis, and broke the common case in the process. cmdline_find_option() will return < 0 on an error, or when the command line argument does not appear at all. In this particular case, the latter is not an error condition, and so the early exit is wrong. Instead, without mem_encrypt= on the command line, the compile time default should be honoured, which could be to enable memory encryption, and this is currently broken. Fix it by setting sme_me_mask to a preliminary value based on the compile time default, and only omitting the command line argument test when cmdline_find_option() returns an error. [ bp: Drop active_by_default while at it. ] Fixes: cbebd68f59f0 ("x86/mm: Fix use of uninitialized buffer in sme_enable()") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240126163918.2908990-2-ardb+git@google.com
2024-01-29x86/mm: Fix memory encryption features advertisementKirill A. Shutemov
When memory encryption is enabled, the kernel prints the encryption flavor that the system supports. The check assumes that everything is AMD SME/SEV if it doesn't have the TDX CPU feature set. Hyper-V vTOM sets cc_vendor to CC_VENDOR_INTEL when it runs as L2 guest on top of TDX, but not X86_FEATURE_TDX_GUEST. Hyper-V only needs memory encryption enabled for I/O without the rest of CoCo enabling. To avoid confusion, check the cc_vendor directly. [ bp: Massage commit message. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240124140217.533748-1-kirill.shutemov@linux.intel.com
2024-01-26x86/mm: Get rid of conditional IF flag handling in page fault pathLinus Torvalds
We had this nonsensical code that would happily handle kernel page faults with interrupts disabled, which makes no sense at all. It turns out that this is legacy code that _used_ to make sense, back when we enabled IRQs as early as possible, and we used to have this code sequence essentially immediately after reading the faulting address from the %cr2 register. Back then, we could have kernel page faults to populate the vmalloc area with interrupts disabled, and they would need to stay disabled for that case. However, the code in question has been moved down in the page fault handling, and is now in the "handle faults in user addresses" section, and apparently nobody ever noticed that it no longer makes sense to handle these page faults with interrupts conditionally disabled. So replace the conditional IRQ enable: if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); with an unconditional one, and add a temporary WARN_ON_ONCE() if some codepath actually does do page faults with interrupts disabled (without also doing a pagefault_disable(), of course). NOTE! We used to allow user space to disable interrupts with iopl(3). That is no longer true since commits: a24ca9976843 ("x86/iopl: Remove legacy IOPL option") b968e84b509d ("x86/iopl: Fake iopl(3) CLI/STI usage") so the WARN_ON_ONCE() is valid for both the kernel and user situation. For some of the history relevant to this code, see particularly commit 8c914cb704a1 ("x86_64: actively synchronize vmalloc area when registering certain callbacks"), which moved this below the vmalloc fault handling. Now that the user_mode() check is irrelevant, we can also move the FAULT_FLAG_USER flag setting down to where the other flag settings are done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240125173457.1281880-1-torvalds@linux-foundation.org
2024-01-10x86/bugs: Rename CONFIG_PAGE_TABLE_ISOLATION => ↵Breno Leitao
CONFIG_MITIGATION_PAGE_TABLE_ISOLATION Step 4/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted new uses that got added since the series was posted. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-5-leitao@debian.org
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08Merge tag 'x86-cleanups-2024-01-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
2024-01-05Merge tag 'mm-hotfixes-stable-2024-01-05-11-35' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc mm fixes from Andrew Morton: "12 hotfixes. Two are cc:stable and the remainder either address post-6.7 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2024-01-05-11-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() mailmap: add entries for Mathieu Othacehe MAINTAINERS: change vmware.com addresses to broadcom.com arch/mm/fault: fix major fault accounting when retrying under per-VMA lock mm/mglru: skip special VMAs in lru_gen_look_around() MAINTAINERS: hand over hwpoison maintainership to Miaohe Lin MAINTAINERS: remove hugetlb maintainer Mike Kravetz mm: fix unmap_mapping_range high bits shift bug mm: memcg: fix split queue list crash when large folio migration mm: fix arithmetic for max_prop_frac when setting max_ratio mm: fix arithmetic for bdi min_ratio mm: align larger anonymous mappings on THP boundaries
2024-01-03arch/x86: Fix typosBjorn Helgaas
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2023-12-29arch/mm/fault: fix major fault accounting when retrying under per-VMA lockSuren Baghdasaryan
A test [1] in Android test suite started failing after [2] was merged. It turns out that after handling a major fault under per-VMA lock, the process major fault counter does not register that fault as major. Before [2] read faults would be done under mmap_lock, in which case FAULT_FLAG_TRIED flag is set before retrying. That in turn causes mm_account_fault() to account the fault as major once retry completes. With per-VMA locks we often retry because a fault can't be handled without locking the whole mm using mmap_lock. Therefore such retries do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can now handle read major faults under per-VMA lock and upon retry the fact there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally we would use an additional VM_FAULT bit to indicate the reason for the retry (could not handle under per-VMA lock vs other reason) but this simpler solution seems to work, so keeping it simple. [1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp [2] https://lore.kernel.org/all/20231006195318.4087158-6-willy@infradead.org/ Link: https://lkml.kernel.org/r/20231226214610.109282-1-surenb@google.com Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10NUMA: optimize detection of memory with no node id assigned by firmwareLiam Ni
Sanity check that makes sure the nodes cover all memory loops over numa_meminfo to count the pages that have node id assigned by the firmware, then loops again over memblock.memory to find the total amount of memory and in the end checks that the difference between the total memory and memory that covered by nodes is less than some threshold. Worse, the loop over numa_meminfo calls __absent_pages_in_range() that also partially traverses memblock.memory. It's much simpler and more efficient to have a single traversal of memblock.memory that verifies that amount of memory not covered by nodes is less than a threshold. Introduce memblock_validate_numa_coverage() that does exactly that and use it instead of numa_meminfo_cover_memory(). Link: https://lkml.kernel.org/r/20231026020329.327329-1-zhiguangni01@gmail.com Signed-off-by: Liam Ni <zhiguangni01@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Binbin Zhou <zhoubinbin@loongson.cn> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feiyang Chen <chenfeiyang@loongson.cn> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07x86/coco: Disable 32-bit emulation by default on TDX and SEVKirill A. Shutemov
The INT 0x80 instruction is used for 32-bit x86 Linux syscalls. The kernel expects to receive a software interrupt as a result of the INT 0x80 instruction. However, an external interrupt on the same vector triggers the same handler. The kernel interprets an external interrupt on vector 0x80 as a 32-bit system call that came from userspace. A VMM can inject external interrupts on any arbitrary vector at any time. This remains true even for TDX and SEV guests where the VMM is untrusted. Put together, this allows an untrusted VMM to trigger int80 syscall handling at any given point. The content of the guest register file at that moment defines what syscall is triggered and its arguments. It opens the guest OS to manipulation from the VMM side. Disable 32-bit emulation by default for TDX and SEV. User can override it with the ia32_emulation=y command line option. [ dhansen: reword the changelog ] Reported-by: Supraja Sridhara <supraja.sridhara@inf.ethz.ch> Reported-by: Benedict Schlüter <benedict.schlueter@inf.ethz.ch> Reported-by: Mark Kuhne <mark.kuhne@inf.ethz.ch> Reported-by: Andrin Bertschi <andrin.bertschi@inf.ethz.ch> Reported-by: Shweta Shinde <shweta.shinde@inf.ethz.ch> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> # v6.0+: 1da5c9b x86: Introduce ia32_enabled() Cc: <stable@vger.kernel.org> # v6.0+
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-30Merge tag 'x86-core-2023-10-29-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Thomas Gleixner: - Limit the hardcoded topology quirk for Hygon CPUs to those which have a model ID less than 4. The newer models have the topology CPUID leaf 0xB correctly implemented and are not affected. - Make SMT control more robust against enumeration failures SMT control was added to allow controlling SMT at boottime or runtime. The primary purpose was to provide a simple mechanism to disable SMT in the light of speculation attack vectors. It turned out that the code is sensible to enumeration failures and worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration which means the primary thread mask is not set up correctly. By chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes the hotplug control stay at its default value "enabled". So the mask is never evaluated. The ongoing rework of the topology evaluation caused XEN/PV to end up with smp_num_siblings == 1, which sets the SMT control to "not supported" and the empty primary thread mask causes the hotplug core to deny the bringup of the APS. Make the decision logic more robust and take 'not supported' and 'not implemented' into account for the decision whether a CPU should be booted or not. - Fake primary thread mask for XEN/PV Pretend that all XEN/PV vCPUs are primary threads, which makes the usage of the primary thread mask valid on XEN/PV. That is consistent with because all of the topology information on XEN/PV is fake or even non-existent. - Encapsulate topology information in cpuinfo_x86 Move the randomly scattered topology data into a separate data structure for readability and as a preparatory step for the topology evaluation overhaul. - Consolidate APIC ID data type to u32 It's fixed width hardware data and not randomly u16, int, unsigned long or whatever developers decided to use. - Cure the abuse of cpuinfo for persisting logical IDs. Per CPU cpuinfo is used to persist the logical package and die IDs. That's really not the right place simply because cpuinfo is subject to be reinitialized when a CPU goes through an offline/online cycle. Use separate per CPU data for the persisting to enable the further topology management rework. It will be removed once the new topology management is in place. - Provide a debug interface for inspecting topology information Useful in general and extremly helpful for validating the topology management rework in terms of correctness or "bug" compatibility. * tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too x86/cpu: Provide debug interface x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids x86/apic: Use u32 for wakeup_secondary_cpu[_64]() x86/apic: Use u32 for [gs]et_apic_id() x86/apic: Use u32 for phys_pkg_id() x86/apic: Use u32 for cpu_present_to_apicid() x86/apic: Use u32 for check_apicid_used() x86/apic: Use u32 for APIC IDs in global data x86/apic: Use BAD_APICID consistently x86/cpu: Move cpu_l[l2]c_id into topology info x86/cpu: Move logical package and die IDs into topology info x86/cpu: Remove pointless evaluation of x86_coreid_bits x86/cpu: Move cu_id into topology info x86/cpu: Move cpu_core_id into topology info hwmon: (fam15h_power) Use topology_core_id() scsi: lpfc: Use topology_core_id() x86/cpu: Move cpu_die_id into topology info x86/cpu: Move phys_proc_id into topology info x86/cpu: Encapsulate topology information in cpuinfo_x86 ...
2023-10-30Merge tag 'x86-mm-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm handling updates from Ingo Molnar: - Add new NX-stack self-test - Improve NUMA partial-CFMWS handling - Fix #VC handler bugs resulting in SEV-SNP boot failures - Drop the 4MB memory size restriction on minimal NUMA nodes - Reorganize headers a bit, in preparation to header dependency reduction efforts - Misc cleanups & fixes * tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size selftests/x86/lam: Zero out buffer for readlink() x86/sev: Drop unneeded #include x86/sev: Move sev_setup_arch() to mem_encrypt.c x86/tdx: Replace deprecated strncpy() with strtomem_pad() selftests/x86/mm: Add new test that userspace stack is in fact NX x86/sev: Make boot_ghcb_page[] static x86/boot: Move x86_cache_alignment initialization to correct spot x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot x86_64: Show CR4.PSE on auxiliaries like on BSP x86/iommu/docs: Update AMD IOMMU specification document URL x86/sev/docs: Update document URL in amd-memory-encryption.rst x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h> ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window x86/numa: Introduce numa_fill_memblks()
2023-10-30Merge tag 'x86_platform_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Borislav Petkov: - Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are actually used in the amd_nb.c enumeration - Add support for extracting NUMA information from devicetree for Hyper-V usages - Add PCI device IDs for the new AMD MI300 AI accelerators - Annotate an array in struct uv_rtc_timer_head with the new __counted_by attribute - Rework UV's NMI action parameter handling * tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs x86/numa: Add Devicetree support x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init() x86/amd_nb: Add AMD Family MI300 PCI IDs x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by x86/platform/uv: Rework NMI "action" modparam handling
2023-10-20x86/pti: Fix kernel warnings for pti= and nopti cmdline optionsJo Van Bulck
Parse the pti= and nopti cmdline options using early_param to fix 'Unknown kernel command line parameters "nopti", will be passed to user space' warnings in the kernel log when nopti or pti= are passed to the kernel cmdline on x86 platforms. Additionally allow the kernel to warn for malformed pti= options. Signed-off-by: Jo Van Bulck <jo.vanbulck@cs.kuleuven.be> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20230819080921.5324-2-jo.vanbulck@cs.kuleuven.be
2023-10-20x86/mm: Drop the 4 MB restriction on minimal NUMA node memory sizeMike Rapoport (IBM)
Qi Zheng reported crashes in a production environment and provided a simplified example as a reproducer: | For example, if we use Qemu to start a two NUMA node kernel, | one of the nodes has 2M memory (less than NODE_MIN_SIZE), | and the other node has 2G, then we will encounter the | following panic: | | BUG: kernel NULL pointer dereference, address: 0000000000000000 | <...> | RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40 | <...> | Call Trace: | <TASK> | deactivate_slab() | bootstrap() | kmem_cache_init() | start_kernel() | secondary_startup_64_no_verify() The crashes happen because of inconsistency between the nodemask that has nodes with less than 4MB as memoryless, and the actual memory fed into the core mm. The commit: 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing") ... that introduced minimal size of a NUMA node does not explain why a node size cannot be less than 4MB and what boot failures this restriction might fix. Fixes have been submitted to the core MM code to tighten up the memory topologies it accepts and to not crash on weird input: mm: page_alloc: skip memoryless nodes entirely mm: memory_hotplug: drop memoryless node from fallback lists Andrew has accepted them into the -mm tree, but there are no stable SHA1's yet. This patch drops the limitation for minimal node size on x86: - which works around the crash without the fixes to the core MM. - makes x86 topologies less weird, - removes an arbitrary and undocumented limitation on NUMA topologies. [ mingo: Improved changelog clarity. ] Reported-by: Qi Zheng <zhengqi.arch@bytedance.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
2023-10-11x86/sev: Drop unneeded #includeAlexander Shishkin
Commit: 20f07a044a76 ("x86/sev: Move common memory encryption code to mem_encrypt.c") ... forgot to remove the include of virtio_config.h from mem_encrypt_amd.c when it moved the related code to mem_encrypt.c (from where this include subsequently got removed by a later commit). Remove it now. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20231010145220.3960055-3-alexander.shishkin@linux.intel.com
2023-10-11x86/sev: Move sev_setup_arch() to mem_encrypt.cAlexander Shishkin
Since commit: 4d96f9109109b ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()") ... the SWIOTLB bounce buffer size adjustment and restricted virtio memory setting also inadvertently apply to TDX: the code is using cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) as a gatekeeping condition, which is also true for TDX, and this is also what we want. To reflect this, move the corresponding code to generic mem_encrypt.c. No functional changes intended. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20231010145220.3960055-2-alexander.shishkin@linux.intel.com
2023-10-10x86/apic: Use u32 for APIC IDs in global dataThomas Gleixner
APIC IDs are used with random data types u16, u32, int, unsigned int, unsigned long. Make it all consistently use u32 because that reflects the hardware register width and fixup the most obvious usage sites of that. The APIC callbacks will be addressed separately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Juergen Gross <jgross@suse.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de
2023-10-06mm: add statistics for PUD level pagetableBaolin Wang
Recently, we found that cross-die access to pagetable pages on ARM64 machines can cause performance fluctuations in our business. Currently, there are no PMU events available to track this situation on our ARM64 machines, so accurate pagetable accounting can help to analyze this issue, but now the PUD level pagetable accounting is missed. So introduce pagetable_pud_ctor/dtor() to help to get accurate PUD pagetable accounting, as well as converting the architectures which use generic PUD pagetable allocation to add corresponding PUD pagetable accounting. Moreover this patch will mark the PUD level pagetable with PG_table flag, which will help to do sanity validation in unpoison_memory(). On my testing machine, I can see more pagetables statistics after the patch with page-types tool: Before patch: flags page-count MB symbolic-flags long-symbolic-flags 0x0000000004000000 27326 106 __________________________g_________________ pgtable After patch: 0x0000000004000000 27541 107 __________________________g_________________ pgtable Link: https://lkml.kernel.org/r/876c71c03a7e69c17722a690e3225a4f7b172fb2.1695017383.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>