Age | Commit message (Collapse) | Author |
|
xen_read_msr_safe() currently passes an uninitialized argument 'err' to
xen_do_read_msr(). But as xen_do_read_msr() may not set the argument,
xen_read_msr_safe() could return err with an unpredictable value.
To ensure correctness, initialize err to 0 (representing success)
in xen_read_msr_safe().
Do the same in xen_read_msr(), even err is not used after being passed
to xen_do_read_msr().
Closes: https://lore.kernel.org/xen-devel/aBxNI_Q0-MhtBSZG@stanley.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250517165713.935384-1-xin@zytor.com
|
|
Both Intel and AMD CPUs support 5-level paging, which is expected to
become more widely adopted in the future. All major x86 Linux
distributions have the feature enabled.
Remove CONFIG_X86_5LEVEL and related #ifdeffery for it to make it more readable.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-4-kirill.shutemov@linux.intel.com
|
|
The main CPUID header <asm/cpuid.h> was originally a storefront for the
headers:
<asm/cpuid/api.h>
<asm/cpuid/leaf_0x2_api.h>
Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between <asm/cpuid.h> and
<asm/cpuid/api.h>.
Migrate all users to the <asm/cpuid/api.h> header, in preparation of
the removal of <asm/cpuid.h>.
Don't remove <asm/cpuid.h> just yet, in case some new code in -next
started using it.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
|
|
Modify the function type of native_read_msr_safe() to:
int native_read_msr_safe(u32 msr, u64 *val)
This change makes the function return an error code instead of the
MSR value, aligning it with the type of native_write_msr_safe().
Consequently, their callers can check the results in the same way.
While at it, convert leftover MSR data type "unsigned int" to u32.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-16-xin@zytor.com
|
|
An MSR value is represented as a 64-bit unsigned integer, with existing
MSR instructions storing it in EDX:EAX as two 32-bit segments.
The new immediate form MSR instructions, however, utilize a 64-bit
general-purpose register to store the MSR value. To unify the usage of
all MSR instructions, let the default MSR access APIs accept an MSR
value as a single 64-bit argument instead of two 32-bit segments.
The dual 32-bit APIs are still available as convenient wrappers over the
APIs that handle an MSR value as a single 64-bit argument.
The following illustrates the updated derivation of the MSR write APIs:
__wrmsrq(u32 msr, u64 val)
/ \
/ \
native_wrmsrq(msr, val) native_wrmsr(msr, low, high)
|
|
native_write_msr(msr, val)
/ \
/ \
wrmsrq(msr, val) wrmsr(msr, low, high)
When CONFIG_PARAVIRT is enabled, wrmsrq() and wrmsr() are defined on top
of paravirt_write_msr():
paravirt_write_msr(u32 msr, u64 val)
/ \
/ \
wrmsrq(msr, val) wrmsr(msr, low, high)
paravirt_write_msr() invokes cpu.write_msr(msr, val), an indirect layer
of pv_ops MSR write call:
If on native:
cpu.write_msr = native_write_msr
If on Xen:
cpu.write_msr = xen_write_msr
Therefore, refactor pv_cpu_ops.write_msr{_safe}() to accept an MSR value
in a single u64 argument, replacing the current dual u32 arguments.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-14-xin@zytor.com
|
|
set_seg() is used to write the following MSRs on Xen:
MSR_FS_BASE
MSR_KERNEL_GS_BASE
MSR_GS_BASE
But none of these MSRs are written using any MSR write safe API.
Therefore there is no need to pass an error pointer argument to
set_seg() for returning an error code to be used in MSR safe APIs.
Remove the error pointer argument.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-13-xin@zytor.com
|
|
As pmu_msr_{read,write}() are now wrappers of pmu_msr_chk_emulated(),
remove them and use pmu_msr_chk_emulated() directly.
As pmu_msr_chk_emulated() could easily return false in the cases where
it would set *emul to false, remove the "emul" argument and use the
return value instead.
While at it, convert the data type of MSR index to u32 in functions
called in pmu_msr_chk_emulated().
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-12-xin@zytor.com
|
|
pmu_msr_{read,write}()
hpa found that pmu_msr_write() is actually a completely pointless
function:
https://lore.kernel.org/lkml/0ec48b84-d158-47c6-b14c-3563fd14bcc4@zytor.com/
all it does is shuffle some arguments, then calls pmu_msr_chk_emulated()
and if it returns true AND the emulated flag is clear then does
*exactly the same thing* that the calling code would have done if
pmu_msr_write() itself had returned true.
And pmu_msr_read() does the equivalent stupidity.
Remove the calls to native_{read,write}_msr{,_safe}() within
pmu_msr_{read,write}(). Instead reuse the existing calling code
that decides whether to call native_{read,write}_msr{,_safe}() based
on the return value from pmu_msr_{read,write}(). Consequently,
eliminate the need to pass an error pointer to pmu_msr_{read,write}().
While at it, refactor pmu_msr_write() to take the MSR value as a u64
argument, replacing the current dual u32 arguments, because the dual
u32 arguments were only used to call native_write_msr{,_safe}(), which
has now been removed.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-11-xin@zytor.com
|
|
The pv_ops PMC read API is defined as:
u64 (*read_pmc)(int counter);
But Xen PMC read functions return 'unsigned long long', make them
return u64 consistently.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-7-xin@zytor.com
|
|
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.
To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.
[ mingo: Clarified the changelog. ]
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix hypercall detection on Xen guests
- Extend the AMD microcode loader SHA check to Zen5, to block loading
of any unreleased standalone Zen5 microcode patches
- Add new Intel CPU model number for Bartlett Lake
- Fix the workaround for AMD erratum 1054
- Fix buggy early memory acceptance between SEV-SNP guests and the EFI
stub
* tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/sev: Avoid shared GHCB page for early memory acceptance
x86/cpu/amd: Fix workaround for erratum 1054
x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches
x86/xen: Fix __xen_hypercall_setfunc()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"Just a single fix for the Xen multicall driver avoiding a percpu
variable referencing initdata by its initializer"
* tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: fix multicall debug feature
|
|
Hypercall detection is failing with xen_hypercall_intel() chosen even on
an AMD processor. Looking at the disassembly, the call to
xen_get_vendor() was removed.
The check for boot_cpu_has(X86_FEATURE_CPUID) was used as a proxy for
the x86_vendor having been set.
When CONFIG_X86_REQUIRED_FEATURE_CPUID=y (the default value), DCE eliminates
the call to xen_get_vendor(). An uninitialized value 0 means
X86_VENDOR_INTEL, so the Intel function is always returned.
Remove the if and always call xen_get_vendor() to avoid this issue.
Fixes: 3d37d9396eb3 ("x86/cpufeatures: Add {REQUIRED,DISABLED} feature configs")
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Xin Li (Intel)" <xin@zytor.com>
Link: https://lore.kernel.org/r/20250410193106.16353-1-jason.andryuk@amd.com
|
|
Initializing a percpu variable with the address of a struct tagged as
.initdata is breaking the build with CONFIG_SECTION_MISMATCH_WARN_ONLY
not set to "y".
Fix that by using an access function instead returning the .initdata
struct address if the percpu space of the struct hasn't been
allocated yet.
Fixes: 368990a7fe30 ("xen: fix multicall debug data referencing")
Reported-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: "Borislav Petkov (AMD)" <bp@alien8.de>
Tested-by: "Borislav Petkov (AMD)" <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250327190602.26015-1-jgross@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc objtool fixes from Ingo Molnar:
- Remove the recently introduced ANNOTATE_IGNORE_ALTERNATIVE noise from
clac()/stac() code to make .s files more readable
- Fix INSN_SYSCALL / INSN_SYSRET semantics
- Fix various false-positive warnings
* tag 'objtool-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix false-positive "ignoring unreachables" warning
objtool: Remove ANNOTATE_IGNORE_ALTERNATIVE from CLAC/STAC
objtool, xen: Fix INSN_SYSCALL / INSN_SYSRET semantics
objtool: Stop UNRET validation on UD2
objtool: Split INSN_CONTEXT_SWITCH into INSN_SYSCALL and INSN_SYSRET
objtool: Fix INSN_CONTEXT_SWITCH handling in validate_unret()
|
|
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When running as a PVH dom0 the ACPI tables exposed to Linux are (mostly)
the native ones, thus exposing the C and P states, that can lead to
attachment of CPU idle and frequency drivers. However the entity in
control of the CPU C and P states is Xen, as dom0 doesn't have a full view
of the system load, neither has all CPUs assigned and identity pinned.
Like it's done for classic PV guests, prevent Linux from using idle or
frequency state drivers when running as a PVH dom0.
On an AMD EPYC 7543P system without this fix a Linux PVH dom0 will keep the
host CPUs spinning at 100% even when dom0 is completely idle, as it's
attempting to use the acpi_idle driver.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250407101842.67228-1-roger.pau@citrix.com>
|
|
Objtool uses an arbitrary rule for INSN_SYSCALL and INSN_SYSRET that
almost works by accident: if it's in a function, control flow continues
after the instruction, otherwise it terminates.
That behavior should instead be based on the semantics of the underlying
instruction. Change INSN_SYSCALL to always preserve control flow and
INSN_SYSRET to always terminate it.
The changed semantic for INSN_SYSCALL requires a tweak to the
!CONFIG_IA32_EMULATION version of xen_entry_SYSCALL_compat(). In Xen,
SYSCALL is a hypercall which usually returns. But in this case it's a
hypercall to IRET which doesn't return. Add UD2 to tell objtool to
terminate control flow, and to prevent undefined behavior at runtime.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com> # for the Xen part
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/19453dfe9a0431b7f016e9dc16d031cad3812a50.1744095216.git.jpoimboe@kernel.org
|
|
PVH dom0 re-uses logic from PV dom0, in which RAM ranges not assigned to
dom0 are re-used as scratch memory to map foreign and grant pages. Such
logic relies on reporting those unpopulated ranges as RAM to Linux, and
mark them as reserved. This way Linux creates the underlying page
structures required for metadata management.
Such approach works fine on PV because the initial balloon target is
calculated using specific Xen data, that doesn't take into account the
memory type changes described above. However on HVM and PVH the initial
balloon target is calculated using get_num_physpages(), and that function
does take into account the unpopulated RAM regions used as scratch space
for remote domain mappings.
This leads to PVH dom0 having an incorrect initial balloon target, which
causes malfunction (excessive memory freeing) of the balloon driver if the
dom0 memory target is later adjusted from the toolstack.
Fix this by using xen_released_pages to account for any pages that are part
of the memory map, but are already unpopulated when the balloon driver is
initialized. This accounts for any regions used for scratch remote
mappings. Note on x86 xen_released_pages definition is moved to
enlighten.c so it's uniformly available for all Xen-enabled builds.
Take the opportunity to unify PV with PVH/HVM guests regarding the usage of
get_num_physpages(), as that avoids having to add different logic for PV vs
PVH in both balloon_add_regions() and arch_xen_unpopulated_init().
Much like a6aa4eb994ee, the code in this changeset should have been part of
38620fc4e893.
Fixes: a6aa4eb994ee ('xen/x86: add extra pages to unpopulated-alloc if available')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250407082838.65495-1-roger.pau@citrix.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
Move the upcall handler to Xen-specific files.
No functional changes.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250314151220.862768-2-brgerst@gmail.com
|
|
Commit 49147beb0ccb ("x86/xen: allow nesting of same lazy mode") was added
as a solution for a core-mm code change where
arch_[enter|leave]_lazy_mmu_mode() started to be called in a nested
manner; see commit bcc6cc832573 ("mm: add default definition of
set_ptes()").
However, now that we have fixed the API to avoid nesting, we no longer
need this capability in the x86 implementation.
Additionally, from code review, I don't believe the fix was ever robust in
the case of preemption occurring while in the nested lazy mode. The
implementation usually deals with preemption by calling
arch_leave_lazy_mmu_mode() from xen_start_context_switch() for the
outgoing task if we are in the lazy mmu mode. Then in
xen_end_context_switch(), it restarts the lazy mode by calling
arch_enter_lazy_mmu_mode() for an incoming task that was in the lazy mode
when it was switched out. But arch_leave_lazy_mmu_mode() will only unwind
a single level of nesting. If we are in the double nest, then it's not
fully unwound and per-cpu variables are left in a bad state.
So the correct solution is to remove the possibility of nesting from the
higher level (which has now been done) and remove this x86-specific
solution.
Link: https://lkml.kernel.org/r/20250303141542.3371656-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juegren Gross <jgross@suse.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
smp_store_cpu_info() is just a wrapper around identify_secondary_cpu()
without further value.
Move the extra bits from smp_store_cpu_info() into identify_secondary_cpu()
and remove the wrapper.
[ darwi: Make it compile and fix up the xen/smp_pv.c instance ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-9-darwi@linutronix.de
|
|
Commit f388f60ca904 ("x86/cpu: Drop configuration options for early 64-bit CPUs")
removes the config symbols MCORE2 and MK8.
With that, the references to those two config symbols in xen's x86 Kconfig
are obsolete. Drop them.
Fixes: f388f60ca904 ("x86/cpu: Drop configuration options for early 64-bit CPUs")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20250303093759.371445-1-lukas.bulwahn@redhat.com
|
|
Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.
Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250213161423.449435-3-riel@surriel.com
|
|
Pick up upstream x86 fixes before applying new patches.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The percpu section is currently linked at absolute address 0, because
older compilers hard-coded the stack protector canary value at a fixed
offset from the start of the GS segment. Now that the canary is a
normal percpu variable, the percpu section does not need to be linked
at a specific address.
x86-64 will now calculate the percpu offsets as the delta between the
initial percpu address and the dynamically allocated memory, like other
architectures. Note that GSBASE is limited to the canonical address
width (48 or 57 bits, sign-extended). As long as the kernel text,
modules, and the dynamically allocated percpu memory are all in the
negative address space, the delta will not overflow this limit.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-9-brgerst@gmail.com
|
|
Older versions of GCC fixed the location of the stack protector canary
at %gs:40. This constraint forced the percpu section to be linked at
absolute address 0 so that the canary could be the first data object in
the percpu section. Supporting the zero-based percpu section requires
additional code to handle relocations for RIP-relative references to
percpu data, extra complexity to kallsyms, and workarounds for linker
bugs due to the use of absolute symbols.
GCC 8.1 supports redefining where the canary is located, allowing it to
become a normal percpu variable instead of at a fixed location. This
removes the constraint that the percpu section must be zero-based.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-8-brgerst@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Three fixes to xen-swiotlb driver:
- two fixes for issues coming up due to another fix in 6.12
- addition of an __init annotation"
* tag 'for-linus-6.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Xen/swiotlb: mark xen_swiotlb_fixup() __init
x86/xen: allow larger contiguous memory regions in PV guests
xen/swiotlb: relax alignment requirements
|
|
With the introduction of kCFI the addition of ENDBR to
SYM_FUNC_START* no longer suffices to make the function indirectly
callable. This now requires the use of SYM_TYPED_FUNC_START.
As such, remove the implicit ENDBR from SYM_FUNC_START* and add some
explicit annotations to fix things up again.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org
|
|
Today a PV guest (including dom0) can create 2MB contiguous memory
regions for DMA buffers at max. This has led to problems at least
with the megaraid_sas driver, which wants to allocate a 2.3MB DMA
buffer.
The limiting factor is the frame array used to do the hypercall for
making the memory contiguous, which has 512 entries and is just a
static array in mmu_pv.c.
In order to not waste memory for non-PV guests, put the initial
frame array into .init.data section and dynamically allocate an array
from the .init_after_bootmem hook of PV guests.
In case a contiguous memory area larger than the initially supported
2MB is requested, allocate a larger buffer for the frame list. Note
that such an allocation is tried only after memory management has been
initialized properly, which is tested via a flag being set in the
.init_after_bootmem hook.
Fixes: 9f40ec84a797 ("xen/swiotlb: add alignment check for dma buffers")
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Alan Robinson <Alan.Robinson@fujitsu.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Three fixes for xen_hypercall_hvm() that was introduced in the 6.13
cycle"
* tag 'for-linus-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove unneeded dummy push from xen_hypercall_hvm()
x86/xen: add FRAME_END to xen_hypercall_hvm()
x86/xen: fix xen_hypercall_hvm() to not clobber %rbx
|
|
Stack alignment of the kernel in 64-bit mode is 8, not 16, so the
dummy push in xen_hypercall_hvm() for aligning the stack to 16 bytes
can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
xen_hypercall_hvm() is missing a FRAME_END at the end, add it.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202502030848.HTNTTuo9-lkp@intel.com/
Fixes: b4845bb63838 ("x86/xen: add central hypercall functions")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
xen_hypercall_hvm(), which is used when running as a Xen PVH guest at
most only once during early boot, is clobbering %rbx. Depending on
whether the caller relies on %rbx to be preserved across the call or
not, this clobbering might result in an early crash of the system.
This can be avoided by using an already saved register instead of %rbx.
Fixes: b4845bb63838 ("x86/xen: add central hypercall functions")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Three minor fixes"
* tag 'for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: update pvcalls_front_accept prototype
Grab mm lock before grabbing pt lock
xen: pcpu: remove unnecessary __ref annotation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
|
|
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory. In most cases, when memory allocation fails, an
immediate panic is required. To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`. This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.
[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- The first part of a restructuring of AMD's representation of a
northbridge which is legacy now, and the creation of the new AMD node
concept which represents the Zen architecture of having a collection
of I/O devices within an SoC. Those nodes comprise the so-called data
fabric on Zen.
This has at least one practical advantage of not having to add a PCI
ID each time a new data fabric PCI device releases. Eventually, the
lot more uniform provider of data fabric functionality amd_node.c
will be used by all the drivers which need it
- Smaller cleanups
* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_node: Use defines for SMN register offsets
x86/amd_node: Remove dependency on AMD_NB
x86/amd_node: Update __amd_smn_rw() error paths
x86/amd_nb: Move SMN access code to a new amd_node driver
x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
x86/amd_nb: Simplify function 3 search
x86/amd_nb: Use topology info to get AMD node count
x86/amd_nb: Simplify root device search
x86/amd_nb: Simplify function 4 search
x86: Start moving AMD node functionality out of AMD_NB
x86/amd_nb: Clean up early_is_amd_nb()
x86/amd_nb: Restrict init function to AMD-based systems
x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Remove the less generic CPU matching infra around struct x86_cpu_desc
and use the generic struct x86_cpu_id thing
- Remove magic naked numbers for CPUID functions and use proper defines
of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around
the tree
- Smaller cleanups and improvements
* tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Make all all CPUID leaf names consistent
x86/fpu: Remove unnecessary CPUID level check
x86/fpu: Move CPUID leaf definitions to common code
x86/tsc: Remove CPUID "frequency" leaf magic numbers.
x86/tsc: Move away from TSC leaf magic numbers
x86/cpu: Move TSC CPUID leaf definition
x86/cpu: Refresh DCA leaf reading code
x86/cpu: Remove unnecessary MwAIT leaf checks
x86/cpu: Use MWAIT leaf definition
x86/cpu: Move MWAIT leaf definition to common header
x86/cpu: Remove 'x86_cpu_desc' infrastructure
x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'
x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'
x86/cpu: Expose only stepping min/max interface
x86/cpu: Introduce new microcode matching helper
x86/cpufeature: Document cpu_feature_enabled() as the default to use
x86/paravirt: Remove the WBINVD callback
x86/cpufeatures: Free up unused feature bits
|
|
Function xen_pin_page calls xen_pte_lock, which in turn grab page
table lock (ptlock). When locking, xen_pte_lock expect mm->page_table_lock
to be held before grabbing ptlock, but this does not happen when pinning
is caused by xen_mm_pin_all.
This commit addresses lockdep warning below, which shows up when
suspending a Xen VM.
[ 3680.658422] Freezing user space processes
[ 3680.660156] Freezing user space processes completed (elapsed 0.001 seconds)
[ 3680.660182] OOM killer disabled.
[ 3680.660192] Freezing remaining freezable tasks
[ 3680.661485] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 3680.685254]
[ 3680.685265] ==================================
[ 3680.685269] WARNING: Nested lock was not taken
[ 3680.685274] 6.12.0+ #16 Tainted: G W
[ 3680.685279] ----------------------------------
[ 3680.685283] migration/0/19 is trying to lock:
[ 3680.685288] ffff88800bac33c0 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: xen_pin_page+0x175/0x1d0
[ 3680.685303]
[ 3680.685303] but this task is not holding:
[ 3680.685308] init_mm.page_table_lock
[ 3680.685311]
[ 3680.685311] stack backtrace:
[ 3680.685316] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16
[ 3680.685324] Tainted: [W]=WARN
[ 3680.685328] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0
[ 3680.685339] Call Trace:
[ 3680.685344] <TASK>
[ 3680.685347] dump_stack_lvl+0x77/0xb0
[ 3680.685356] __lock_acquire+0x917/0x2310
[ 3680.685364] lock_acquire+0xce/0x2c0
[ 3680.685369] ? xen_pin_page+0x175/0x1d0
[ 3680.685373] _raw_spin_lock_nest_lock+0x2f/0x70
[ 3680.685381] ? xen_pin_page+0x175/0x1d0
[ 3680.685386] xen_pin_page+0x175/0x1d0
[ 3680.685390] ? __pfx_xen_pin_page+0x10/0x10
[ 3680.685394] __xen_pgd_walk+0x233/0x2c0
[ 3680.685401] ? stop_one_cpu+0x91/0x100
[ 3680.685405] __xen_pgd_pin+0x5d/0x250
[ 3680.685410] xen_mm_pin_all+0x70/0xa0
[ 3680.685415] xen_pv_pre_suspend+0xf/0x280
[ 3680.685420] xen_suspend+0x57/0x1a0
[ 3680.685428] multi_cpu_stop+0x6b/0x120
[ 3680.685432] ? update_cpumasks_hier+0x7c/0xa60
[ 3680.685439] ? __pfx_multi_cpu_stop+0x10/0x10
[ 3680.685443] cpu_stopper_thread+0x8c/0x140
[ 3680.685448] ? smpboot_thread_fn+0x20/0x1f0
[ 3680.685454] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 3680.685458] smpboot_thread_fn+0xed/0x1f0
[ 3680.685462] kthread+0xde/0x110
[ 3680.685467] ? __pfx_kthread+0x10/0x10
[ 3680.685471] ret_from_fork+0x2f/0x50
[ 3680.685478] ? __pfx_kthread+0x10/0x10
[ 3680.685482] ret_from_fork_asm+0x1a/0x30
[ 3680.685489] </TASK>
[ 3680.685491]
[ 3680.685491] other info that might help us debug this:
[ 3680.685497] 1 lock held by migration/0/19:
[ 3680.685500] #0: ffffffff8284df38 (pgd_lock){+.+.}-{3:3}, at: xen_mm_pin_all+0x14/0xa0
[ 3680.685512]
[ 3680.685512] stack backtrace:
[ 3680.685518] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16
[ 3680.685528] Tainted: [W]=WARN
[ 3680.685531] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0
[ 3680.685538] Call Trace:
[ 3680.685541] <TASK>
[ 3680.685544] dump_stack_lvl+0x77/0xb0
[ 3680.685549] __lock_acquire+0x93c/0x2310
[ 3680.685554] lock_acquire+0xce/0x2c0
[ 3680.685558] ? xen_pin_page+0x175/0x1d0
[ 3680.685562] _raw_spin_lock_nest_lock+0x2f/0x70
[ 3680.685568] ? xen_pin_page+0x175/0x1d0
[ 3680.685572] xen_pin_page+0x175/0x1d0
[ 3680.685578] ? __pfx_xen_pin_page+0x10/0x10
[ 3680.685582] __xen_pgd_walk+0x233/0x2c0
[ 3680.685588] ? stop_one_cpu+0x91/0x100
[ 3680.685592] __xen_pgd_pin+0x5d/0x250
[ 3680.685596] xen_mm_pin_all+0x70/0xa0
[ 3680.685600] xen_pv_pre_suspend+0xf/0x280
[ 3680.685607] xen_suspend+0x57/0x1a0
[ 3680.685611] multi_cpu_stop+0x6b/0x120
[ 3680.685615] ? update_cpumasks_hier+0x7c/0xa60
[ 3680.685620] ? __pfx_multi_cpu_stop+0x10/0x10
[ 3680.685625] cpu_stopper_thread+0x8c/0x140
[ 3680.685629] ? smpboot_thread_fn+0x20/0x1f0
[ 3680.685634] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 3680.685638] smpboot_thread_fn+0xed/0x1f0
[ 3680.685642] kthread+0xde/0x110
[ 3680.685645] ? __pfx_kthread+0x10/0x10
[ 3680.685649] ret_from_fork+0x2f/0x50
[ 3680.685654] ? __pfx_kthread+0x10/0x10
[ 3680.685657] ret_from_fork_asm+0x1a/0x30
[ 3680.685662] </TASK>
[ 3680.685267] xen:grant_table: Grant tables using version 1 layout
[ 3680.685921] OOM killer enabled.
[ 3680.685934] Restarting tasks ... done.
Signed-off-by: Maksym Planeta <maksym@exostellar.io>
Reviewed-by: Juergen Gross <jgross@suse.com>
Message-ID: <20241204103516.3309112-1-maksym@exostellar.io>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix
for consistency and vertical alignment.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits
Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
|
|
Begin constructing a common place to keep all CPUID leaf definitions.
Move CPUID_MWAIT_LEAF to the CPUID header and include it where
needed.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205028.EE94D02A%40davehans-spike.ostc.intel.com
|
|
The hypercall page is no longer needed. It can be removed, as from the
Xen perspective it is optional.
But, from Linux's perspective, it removes naked RET instructions that
escape the speculative protections that Call Depth Tracking and/or
Untrain Ret are trying to achieve.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Add generic hypercall functions usable for all normal (i.e. not iret)
hypercalls. Depending on the guest type and the processor vendor
different functions need to be used due to the to be used instruction
for entering the hypervisor:
- PV guests need to use syscall
- HVM/PVH guests on Intel need to use vmcall
- HVM/PVH guests on AMD and Hygon need to use vmmcall
As PVH guests need to issue hypercalls very early during boot, there
is a 4th hypercall function needed for HVM/PVH which can be used on
Intel and AMD processors. It will check the vendor type and then set
the Intel or AMD specific function to use via static_call().
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
|
|
Instead of jumping to the Xen hypercall page for doing the iret
hypercall, directly code the required sequence in xen-asm.S.
This is done in preparation of no longer using hypercall page at all,
as it has shown to cause problems with speculation mitigations.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|