Age | Commit message (Collapse) | Author |
|
Although IBS "swfilt" can prevent leaking samples with kernel RIP to the
userspace, there are few subtle cases where a 'data' address and/or a
'branch target' address can fall under kernel address range although RIP
is from userspace. Prevent leaking kernel 'data' addresses by discarding
such samples when {exclude_kernel=1,swfilt=1}.
IBS can now be invoked by unprivileged user with the introduction of
"swfilt". However, this creates a loophole in the interface where an
unprivileged user can get physical address of the userspace virtual
addresses through IBS register raw dump (PERF_SAMPLE_RAW). Prevent this
as well.
This upstream commit fixed the most obvious leak:
65a99264f5e5 perf/x86: Check data address for IBS software filter
Follow that up with a more complete fix.
Fixes: d29e744c7167 ("perf/x86: Relax privilege filter restriction on AMD IBS")
Suggested-by: Matteo Rizzo <matteorizzo@google.com>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250321161251.1033-1-ravi.bangoria@amd.com
|
|
I wonder how many people were checking their glibc version when
considering whether to enable this option.
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heidelberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-7-b0cbaa6fa338@ixit.cz
|
|
I was unable to find a good description of the ServerWorks CNB20LE
chipset. However, it was probably exclusively used with the Pentium III
processor (this CPU model was used in all references to it that I
found where the CPU model was provided: dmesgs in [1] and [2];
[3] page 2; [4]-[7]).
As is widely known, the Pentium III processor did not support the 64-bit
mode, support for which was introduced by Intel a couple of years later.
So it is safe to assume that no systems with the CNB20LE chipset have
amd64 and the CONFIG_PCI_CNB20LE_QUIRK may now depend on X86_32.
Additionally, I have determined that most computers with the CNB20LE
chipset did have ACPI support and this driver was inactive on them.
I have submitted a patch to remove this driver, but it was met with
resistance [8].
[1] Jim Studt, Re: Problem with ServerWorks CNB20LE and lost interrupts
Linux Kernel Mailing List, https://lkml.org/lkml/2002/1/11/111
[2] RedHat Bug 665109 - e100 problems on old Compaq Proliant DL320
https://bugzilla.redhat.com/show_bug.cgi?id=665109
[3] R. Hughes-Jones, S. Dallison, G. Fairey, Performance Measurements on
Gigabit Ethernet NICs and Server Quality Motherboards,
http://datatag.web.cern.ch/papers/pfldnet2003-rhj.doc
[4] "Hardware for Linux",
Probe #d6b5151873 of Intel STL2-bd A28808-302 Desktop Computer (STL2)
https://linux-hardware.org/?probe=d6b5151873
[5] "Hardware for Linux", Probe #0b5d843f10 of Compaq ProLiant DL380
https://linux-hardware.org/?probe=0b5d843f10
[6] Ubuntu Forums, Dell Poweredge 2400 - Adaptec SCSI Bus AIC-7880
https://ubuntuforums.org/showthread.php?t=1689552
[7] Ira W. Snyder, "BISECTED: 2.6.35 (and -git) fail to boot: APIC problems"
https://lkml.org/lkml/2010/8/13/220
[8] Bjorn Helgaas, "Re: [PATCH] x86/pci: drop ServerWorks / Broadcom
CNB20LE PCI host bridge driver"
https://lore.kernel.org/lkml/20220318165535.GA840063@bhelgaas/T/
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-6-b0cbaa6fa338@ixit.cz
|
|
This configuration option had no help text, so add it.
CONFIG_EXPERT is enabled on some distribution kernels, so people using a
distribution kernel's configuration as a starting point will see this
option.
[ mingo: Standardized the new Kconfig text a bit. ]
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-5-b0cbaa6fa338@ixit.cz
|
|
The order of the entries matches the order they appear in Kconfig.
In 2011, AMD Elan was moved to Kconfig.cpu and the dependency on
X86_EXTENDED_PLATFORM was dropped in:
ce9c99af8d4b ("x86, cpu: Move AMD Elan Kconfig under "Processor family"")
Support for Moorestown MID devices was removed in 2012 in:
1a8359e411eb ("x86/mid: Remove Intel Moorestown")
SGI 320/540 (Visual Workstation) was removed in 2014 in:
c5f9ee3d665a ("x86, platforms: Remove SGI Visual Workstation")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-4-b0cbaa6fa338@ixit.cz
|
|
So that these options will be displayed together in menuconfig etc.
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heidelberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-3-b0cbaa6fa338@ixit.cz
|
|
It appears that (X86_64 || X86_32) is always true on x86.
This logical OR directive was introduced in:
6ea3038648da ("arch/x86: remove depends on CONFIG_EXPERIMENTAL")
By (EXPERIMENTAL && X86_32) turning into (X86_32). Since
this change was an identity transformation, nobody noticed
that the condition turned into 'true'.
[ mingo: Updated changelog ]
Fixes: 6ea3038648da ("arch/x86: remove depends on CONFIG_EXPERIMENTAL")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: David Heideberg <david@ixit.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-2-b0cbaa6fa338@ixit.cz
|
|
As many current platforms (most modern Intel CPUs and QEMU) have x2APIC
present, enable CONFIG_X86_X2APIC by default as it gives performance
and functionality benefits. Additionally, if the BIOS has already
switched APIC to x2APIC mode, but CONFIG_X86_X2APIC is disabled, the
kernel will panic in arch/x86/kernel/apic/apic.c .
Also improve the help text, which was confusing and really did not
describe what the feature is about.
Help text references and discussion:
Both Intel [1] and AMD [3] spell the name as "x2APIC", not "x2apic".
"It allows faster access to the local APIC"
[2], chapter 2.1, page 15:
"More efficient MSR interface to access APIC registers."
"x2APIC was introduced in Intel CPUs around 2008":
I was unable to find specific information which Intel CPUs
support x2APIC. Wikipedia claims it was "introduced with the
Nehalem microarchitecture in November 2008", but I was not able
to confirm this independently. At least some Nehalem CPUs do not
support x2APIC [1].
The documentation [2] is dated June 2008. Linux kernel also
introduced x2APIC support in 2008, so the year seems to be
right.
"and in AMD EPYC CPUs in 2019":
[3], page 15:
"AMD introduced an x2APIC in our EPYC 7002 Series processors for
the first time."
"It is also frequently emulated in virtual machines, even when the host
CPU does not support it."
[1]
"If this configuration option is disabled, the kernel will not boot on
some platforms that have x2APIC enabled."
According to some BIOS documentation [4], the x2APIC may be
"disabled", "enabled", or "force enabled" on this system.
I think that "enabled" means "made available to the operating
system, but not already turned on" and "force enabled" means
"already switched to x2APIC mode when the OS boots". Only in the
latter mode a kernel without CONFIG_X86_X2APIC will panic in
validate_x2apic() in arch/x86/kernel/apic/apic.c .
QEMU 4.2.1 and my Intel HP laptop (bought in 2019) use the
"enabled" mode and the kernel does not panic.
[1] "Re: [Qemu-devel] [Question] why x2apic's set by default without host sup"
https://lists.gnu.org/archive/html/qemu-devel/2013-07/msg03527.html
[2] Intel® 64 Architecture x2APIC Specification,
( https://www.naic.edu/~phil/software/intel/318148.pdf )
[3] Workload Tuning Guide for AMD EPYC ™ 7002 Series Processor Based
Servers Application Note,
https://developer.amd.com/wp-content/resources/56745_0.80.pdf
[4] UEFI System Utilities and Shell Command Mobile Help for HPE ProLiant
Gen10, ProLiant Gen10 Plus Servers and HPE Synergy:
Enabling or disabling Processor x2APIC Support
https://techlibrary.hpe.com/docs/iss/proliant-gen10-uefi/s_enable_disable_x2APIC_support.html
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-1-b0cbaa6fa338@ixit.cz
|
|
Alias all the addresses used in the past and currently to the single
contact address.
Link: https://lkml.kernel.org/r/20250319085251.3335678-1-alexander.sverdlin@siemens.com
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
proc_pid_wchan() used to report kernel addresses to user space but that is
no longer the case today. Bring the comment above proc_pid_wchan() in
sync with the implementation.
Link: https://lkml.kernel.org/r/20250319210222.1518771-1-bvanassche@acm.org
Fixes: b2f73922d119 ("fs/proc, core/debug: Don't expose absolute kernel addresses via wchan")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Improve readability and maintainability by replacing a hard coded string
allocation and formatting by using the kasprintf() helper.
It also eliminates the GCC compiler warning (with CONFIG_WERROR=y, which
is default, it becomes an error:
kernel/relay.c:357:42: error: `snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
Link: https://lkml.kernel.org/r/20250317212948.1811176-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace open coded variant of DEFINE_RES(). No functional changes intended.
Link: https://lkml.kernel.org/r/20250317181412.1560630-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace open coded variants of DEFINE_RES_*_NAMED().
Link: https://lkml.kernel.org/r/20250317181412.1560630-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace open coded variant of DEFINE_RES_NAMED_DESC().
Link: https://lkml.kernel.org/r/20250317181412.1560630-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "resource: Split and use DEFINE_RES*() macros", v2.
Replace open coded variants of DEFINE_RES*() macros.
Note, there are many more possibilities over the kernel and even in
reources.c, however the latter contains not so trivial leftovers. That's
why the examples cover only straightforward conversions.
This patch (of 4):
In some cases it would be useful to supply predefined descriptor of the
resource. For this, introduce DEFINE_RES_NAMED_DESC() macro.
While at it, provide DEFINE_RES() that takes only start, size,
and flags.
Link: https://lkml.kernel.org/r/20250317181412.1560630-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20250317181412.1560630-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a hung_task detector mutex blocking test sample code.
This module will create a dummy file on the debugfs. That file will cause
the read process to sleep for enough long time (256 seconds) while holding
a mutex. As a result, the second process will wait on the mutex for a
prolonged duration and be detected by the hung_task detector.
Usage is;
> cd /sys/kernel/debug/hung_task
> cat mutex & cat mutex
and wait for hung_task message.
[akpm@linux-foundation.org: make `hung_task_dir' static]
Closes: https://lore.kernel.org/oe-kbuild-all/202503180827.4StpuFrD-lkp@intel.com/
Link: https://lkml.kernel.org/r/174046696281.2194069.4567490148001547311.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "hung_task: Dump the blocking task stacktrace", v4.
The hung_task detector is very useful for detecting the lockup. However,
since it only dumps the blocked (uninterruptible sleep) processes, it is
not enough to identify the root cause of that lockup.
For example, if a process holds a mutex and sleep an event in
interruptible state long time, the other processes will wait on the mutex
in uninterruptible state. In this case, the waiter processes are dumped,
but the blocker process is not shown because it is sleep in interruptible
state.
This adds a feature to dump the blocker task which holds a mutex
when detecting a hung task. e.g.
INFO: task cat:115 blocked for more than 122 seconds.
Not tainted 6.14.0-rc3-00003-ga8946be3de00 #156
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:cat state:D stack:13432 pid:115 tgid:115 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_preempt_disabled+0x54/0xa0
schedule+0xb7/0x140
? __mutex_lock+0x51b/0xa60
? __mutex_lock+0x51b/0xa60
schedule_preempt_disabled+0x54/0xa0
__mutex_lock+0x51b/0xa60
read_dummy+0x23/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe99071828 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe99071870 RDI: 0000000000000003
RBP: 00007ffe99071870 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 00000000132fd3a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>
INFO: task cat:115 is blocked on a mutex likely owned by task cat:114.
task:cat state:S stack:13432 pid:114 tgid:114 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_timeout+0xa8/0x120
schedule+0xb7/0x140
schedule_timeout+0xa8/0x120
? __pfx_process_timeout+0x10/0x10
msleep_interruptible+0x3e/0x60
read_dummy+0x2d/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe3e0147b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe3e014800 RDI: 0000000000000003
RBP: 00007ffe3e014800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 000000001a0a93a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>
TBD: We can extend this feature to cover other locks like rwsem and
rt_mutex, but rwsem requires to dump all the tasks which acquire and wait
that rwsem. We can follow the waiter link but the output will be a bit
different compared with mutex case.
This patch (of 2):
The "hung_task" shows a long-time uninterruptible slept task, but most
often, it's blocked on a mutex acquired by another task. Without dumping
such a task, investigating the root cause of the hung task problem is very
difficult.
This introduce task_struct::blocker_mutex to point the mutex lock which
this task is waiting for. Since the mutex has "owner" information, we can
find the owner task and dump it with hung tasks.
Note: the owner can be changed while dumping the owner task, so
this is "likely" the owner of the mutex.
With this change, the hung task shows blocker task's info like below;
INFO: task cat:115 blocked for more than 122 seconds.
Not tainted 6.14.0-rc3-00003-ga8946be3de00 #156
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:cat state:D stack:13432 pid:115 tgid:115 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_preempt_disabled+0x54/0xa0
schedule+0xb7/0x140
? __mutex_lock+0x51b/0xa60
? __mutex_lock+0x51b/0xa60
schedule_preempt_disabled+0x54/0xa0
__mutex_lock+0x51b/0xa60
read_dummy+0x23/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe99071828 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe99071870 RDI: 0000000000000003
RBP: 00007ffe99071870 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 00000000132fd3a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>
INFO: task cat:115 is blocked on a mutex likely owned by task cat:114.
task:cat state:S stack:13432 pid:114 tgid:114 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_timeout+0xa8/0x120
schedule+0xb7/0x140
schedule_timeout+0xa8/0x120
? __pfx_process_timeout+0x10/0x10
msleep_interruptible+0x3e/0x60
read_dummy+0x2d/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe3e0147b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe3e014800 RDI: 0000000000000003
RBP: 00007ffe3e014800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 000000001a0a93a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>
[akpm@linux-foundation.org: implement debug_show_blocker() in C rather than in CPP]
Link: https://lkml.kernel.org/r/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com
Link: https://lkml.kernel.org/r/174046695384.2194069.16796289525958195643.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE`
is enabled, so the `__maybe_unused` attribute is not necessary. This
patch removes it and keeps the variable declaration within the `#ifdef`
block for better clarity.
Link: https://lkml.kernel.org/r/20250319091726.401158-1-liuyerd@163.com
Signed-off-by: Liu Ye<liuye@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Kernel test robot reports the following crash on 32-bit system with
HIGHMEM and DEBUG_VIRTUAL:
[ 0.056128][ T0] kernel BUG at arch/x86/mm/physaddr.c:77!
PANIC: early exception 0x06 IP 60:c116539d error 0 cr2 0x0
[ 0.056916][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc4-00010-ga4dbe5c71817 #1
[ 0.057570][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 0.058299][ T0] EIP: __phys_addr (arch/x86/mm/physaddr.c:77)
[ 0.058633][ T0] Code: 00 74 33 89 f0 e8 d3 8b 2e 00 89 c3 0f b6 d0 b8 58 bb 4b c5 31 c9 6a 00 e8 70 f5 15 00 83 c4 04 84 db 74 25 ff 05 78 de 5d c5 <0f> 0b b8 c8 91 ea c4 e8 e7 6e ea ff b8 58 bb 4b c5 31 d2 31 c9 6a
All code
[ 0.060017][ T0] EAX: 00000000 EBX: c61f7001 ECX: 00000000 EDX: 00000000
[ 0.060519][ T0] ESI: c61f7000 EDI: 061f7000 EBP: c4e31f04 ESP: c61f7000
[ 0.061016][ T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: cff4 EFLAGS: 00210002
[ 0.061560][ T0] CR0: 80050033 CR2: 00000000 CR3: 059fc000 CR4: 00000090
[ 0.062060][ T0] Call Trace:
[ 0.062288][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:478)
[ 0.062588][ T0] ? early_fixup_exception (arch/x86/include/asm/nospec-branch.h:595)
[ 0.062968][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:352)
[ 0.063360][ T0] ? __phys_addr (arch/x86/mm/physaddr.c:77)
[ 0.063677][ T0] ? one_page_table_init (arch/x86/mm/init_32.c:100)
[ 0.064037][ T0] ? page_table_range_init (arch/x86/mm/init_32.c:227)
[ 0.064411][ T0] ? permanent_kmaps_init (include/linux/pgtable.h:191 include/linux/pgtable.h:196 arch/x86/mm/init_32.c:395)
[ 0.064814][ T0] ? paging_init (arch/x86/mm/init_32.c:677)
[ 0.065118][ T0] ? native_pagetable_init (arch/x86/mm/init_32.c:481)
[ 0.065503][ T0] ? setup_arch (arch/x86/kernel/setup.c:1131)
[ 0.065819][ T0] ? start_kernel (include/linux/jump_label.h:267 init/main.c:920)
[ 0.066143][ T0] ? i386_start_kernel (arch/x86/kernel/head32.c:79)
[ 0.066501][ T0] ? startup_32_smp (arch/x86/kernel/head_32.S:292)
The crash happens because commit e120d1bc12da ("arch, mm: set high_memory
in free_area_init()") moved initialization of high_memory after
__vmalloc_start_set and with high_memory still set to 0 any address passes
is_vmalloc_addr() check.
Restore early initialization of high_memory on 32-bit systems in
initmem_init().
Link: https://lkml.kernel.org/r/20250319122337.1538924-1-rppt@kernel.org
Fixes: e120d1bc12da ("arch, mm: set high_memory in free_area_init()")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503191442.112e954f-lkp@intel.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Syzkaller reports a bug as follows:
Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000
Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users
Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed
page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e
memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff)
raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9
raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000
page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio))
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3
Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780
x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0
x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4
x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000
x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c
x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b
x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000
x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001
x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000
Call trace:
add_to_swap+0xbc/0x158
shrink_folio_list+0x12ac/0x2648
shrink_inactive_list+0x318/0x948
shrink_lruvec+0x450/0x720
shrink_node_memcgs+0x280/0x4a8
shrink_node+0x128/0x978
balance_pgdat+0x4f0/0xb20
kswapd+0x228/0x438
kthread+0x214/0x230
ret_from_fork+0x10/0x20
I can reproduce this issue with the following steps:
1) When a dirty swapcache page is isolated by reclaim process and the
page isn't locked, inject memory failure for the page.
me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
but fails. Reclaim process will put the hwpoisoned page back to lru.
2) The process that maps the hwpoisoned page exits, the page is deleted
the page will never be freed and will be in the lru forever.
3) If we trigger a reclaim again and tries to reclaim the page,
add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
cleared.
To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the
hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap
it in shrink_folio_list(), otherwise the folio will fail to be unmaped by
hwpoison_user_mappings() since the folio isn't in lru list.
Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/vmscan: don't try to reclaim hwpoison folio".
Fix a bug during memory reclaim if folio is hwpoisoned.
This patch (of 2):
Introduce helper folio_contain_hwpoisoned_page() to check if the entire
folio is hwpoisoned or it contains hwpoisoned pages.
Link: https://lkml.kernel.org/r/20250318083939.987651-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20250318083939.987651-2-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The commit 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin
counter") introduced the pswpin and pswpout items in the memory.stat of
cgroup v2. Therefore, update them accordingly in the cgroup-v2
documentation.
Link: https://lkml.kernel.org/r/20250318075833.90615-3-jiahao.kernel@gmail.com
Fixes: 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter")
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Adding Proactive Memory Reclaim Statistics".
These two patches are related to proactive memory reclaim.
Patch 1 Split proactive reclaim statistics from direct reclaim counters
and introduces new counters: pgsteal_proactive, pgdemote_proactive,
and pgscan_proactive.
Patch 2 Adds pswpin and pswpout items to the cgroup-v2 documentation.
This patch (of 2):
In proactive memory reclaim scenarios, it is necessary to accurately track
proactive reclaim statistics to dynamically adjust the frequency and
amount of memory being reclaimed proactively. Currently, proactive
reclaim is included in direct reclaim statistics, which can make these
direct reclaim statistics misleading.
Therefore, separate proactive reclaim memory from the direct reclaim
counters by introducing new counters: pgsteal_proactive,
pgdemote_proactive, and pgscan_proactive, to avoid confusion with direct
reclaim.
Link: https://lkml.kernel.org/r/20250318075833.90615-1-jiahao.kernel@gmail.com
Link: https://lkml.kernel.org/r/20250318075833.90615-2-jiahao.kernel@gmail.com
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
create_pagecache_thp_and_fd() was previously writing a file sized at twice
the PMD size by making a per-byte write syscall. This was quite slow when
the PMD size is 4M, but completely intolerable for 32M (PMD size for
arm64's 16K page size), and 512M (PMD size for arm64's 64K page size).
The byte pattern has a 256 byte period, so let's create a 1K buffer and
fill it with exactly 4 periods. Then we can write the buffer as many
times as is required to fill the file. This makes things much more
tolerable.
The test now passes for 16K page size. It still fails for 64K page size
because MAX_PAGECACHE_ORDER is too small for 512M folio size (I think).
Link: https://lkml.kernel.org/r/20250318174343.243631-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
uffd-unit-tests uses a memory area with a fixed 32M size. Then it
calculates the number of pages by dividing by page_size, which itself is
either the base page size or the PMD huge page size depending on the test
config. For the latter, we end up with nr_pages=1 for arm64 16K base
pages, and nr_pages=0 for 64K base pages. This doesn't end well.
So let's make the 32M size a floor and also ensure that we have at least 2
pages given the PMD size. With this change, the tests pass on arm64 64K
base page size configuration.
Link: https://lkml.kernel.org/r/20250318174343.243631-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Document availability and meaning of "active" DAMOS filter type on design
document. Since introduction of the type requires no additional user ABI,
usage and ABI document need no update.
Link: https://lkml.kernel.org/r/20250318183029.2062917-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: introduce DAMOS filter type for active pages".
The memory reclaim algorithm categorizes pages into active and inactive
lists, separately for file and anon pages. The system's performance
relies heavily on the (relative and absolute) accuracy of this
categorization.
This patch series add a new DAMOS filter for pages' activeness, giving us
visibility into the access frequency of the pages on each list. This
insight can help us diagnose issues with the active-inactive balancing
dynamics, and make decisions to optimize reclaim efficiency and memory
utilization.
For instance, we might decide to enable DAMON_LRU_SORT, if we find that
there are pages on the active list that are infrequently accessed, or less
frequently accessed than pages on the inactive list.
This patch (of 2):
Implement a DAMOS filter type for active pages on DAMON kernel API, and
add support of it from the physical address space DAMON operations set
(paddr).
Link: https://lkml.kernel.org/r/20250318183029.2062917-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20250318183029.2062917-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prior to commit 38607c62b34b ("fs/dax: properly refcount fs dax pages")
dax_associate_entry() and dax_disassociate_entry() would implicitly skip
zero and empty dax entries using the for_each_mapped_pfn() macro. The use
of compound ZONE_DEVICE folios removed the need for this macro and so it
was removed, leading dax_folio_put() to be called on zero pages.
This lead to the below warning. To fix this explicitly skip zero and
empty entries in dax_associate/disassociate_entry().
[ 27.536963] ------------[ cut here ]------------
[ 27.537674] WARNING: CPU: 11 PID: 874 at fs/dax.c:415 dax_folio_put.isra.0+0x10d/0x170
[ 27.538844] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[ 27.539732] CPU: 11 UID: 0 PID: 874 Comm: ctl_prefault Tainted: G W 6.14.0-rc2+ #1104
[ 27.541093] Tainted: [W]=WARN
[ 27.541549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204
[ 27.543197] RIP: 0010:dax_folio_put.isra.0+0x10d/0x170
[ 27.543970] Code: 20 48 85 c0 0f 84 29 ff ff ff 48 83 e8 01 48 89 47 20 0f 84 1b ff ff ff 48 83 c4 10 5b 5d 41 5c c3 cc cc4
[ 27.546723] RSP: 0000:ffff961e4102fae0 EFLAGS: 00010002
[ 27.547505] RAX: 0000000000000001 RBX: ffffc9cce4e18000 RCX: 0000000000000009
[ 27.548564] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8a2a7badca40
[ 27.549630] RBP: ffffc9cce4e18000 R08: 0000000000009ffb R09: 00000000ffffdfff
[ 27.550691] R10: 00000000ffffdfff R11: ffffffffa4e823a0 R12: 0000000000000000
[ 27.551748] R13: 0000000000000000 R14: 0000000010f10005 R15: 0000000000000004
[ 27.552819] FS: 00007f5f539d74c0(0000) GS:ffff8a2a7bac0000(0000) knlGS:0000000000000000
[ 27.554015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 27.554873] CR2: 00007f5f52e00000 CR3: 0000000909340000 CR4: 00000000000006f0
[ 27.555938] Call Trace:
[ 27.556318] <TASK>
[ 27.556650] ? __warn+0x91/0x190
[ 27.557146] ? dax_folio_put.isra.0+0x10d/0x170
[ 27.557824] ? report_bug+0x164/0x190
[ 27.558378] ? handle_bug+0x54/0x90
[ 27.558898] ? exc_invalid_op+0x17/0x70
[ 27.559489] ? asm_exc_invalid_op+0x1a/0x20
[ 27.560125] ? dax_folio_put.isra.0+0x10d/0x170
[ 27.560808] dax_insert_entry+0x1e1/0x420
[ 27.561419] dax_fault_iter+0x252/0x860
[ 27.561995] dax_iomap_pmd_fault+0x23c/0x4a0
[ 27.562651] ext4_dax_huge_fault+0x1e2/0x450
[ 27.563296] __handle_mm_fault+0x6c8/0x12b0
[ 27.563920] ? do_user_addr_fault+0x1ca/0x670
[ 27.564577] ? lock_vma_under_rcu+0x178/0x3b0
[ 27.565235] handle_mm_fault+0xe5/0x290
[ 27.565816] do_user_addr_fault+0x208/0x670
[ 27.566446] exc_page_fault+0x6d/0x230
[ 27.567008] asm_exc_page_fault+0x26/0x30
[ 27.567610] RIP: 0033:0x7f5f543bcb4f
[ 27.568152] Code: 45 f0 48 8b 45 f0 48 8b 4d f8 48 03 41 18 48 89 45 e8 48 8b 45 f0 48 3b 45 e8 0f 83 97 00 00 00 48 8b 458
[ 27.570895] RSP: 002b:00007ffc2d774460 EFLAGS: 00010287
[ 27.571672] RAX: 00007f5f52e00000 RBX: 0000000000200000 RCX: 000055760153fc00
[ 27.572731] RDX: 0000000000000000 RSI: 0000557601542a20 RDI: 000055760153fc00
[ 27.573787] RBP: 00007ffc2d774460 R08: 0000000000000000 R09: 0000000000000073
[ 27.574840] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffc2d77534b
[ 27.575897] R13: 00007ffc2d774aa0 R14: 0000000000800000 R15: 0000000000800000
[ 27.576961] </TASK>
[ 27.577301] irq event stamp: 13394
[ 27.577810] hardirqs last enabled at (13393): [<ffffffffa3485780>] flush_tlb_mm_range+0x1c0/0x220
[ 27.579138] hardirqs last disabled at (13394): [<ffffffffa450d0c7>] _raw_spin_lock_irq+0x47/0x50
[ 27.580428] softirqs last enabled at (12530): [<ffffffffa433941a>] xs_tcp_send_request+0x22a/0x2e0
[ 27.581762] softirqs last disabled at (12528): [<ffffffffa40a60fd>] release_sock+0x1d/0xb0
[ 27.582986] ---[ end trace 0000000000000000 ]---
Link: https://lkml.kernel.org/r/20250319013301.369822-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503102229.122fbd6c-lkp@intel.com
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit dcdfdd40fa82 ("mm: Add support for unaccepted memory") added a
entry to meminfo but did not document it in the proc.rst file.
This counter tracks the amount of "Unaccepted" guest memory for some
Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP.
Add the missing entry in the documentation.
Link: https://lkml.kernel.org/r/20250317230403.79632-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As discussed here:
https://lore.kernel.org/lkml/Z9RRkL1hom48z3Tt@google.com/
This code could benefit from some more commentary.
To avoid needing to comment the same thing in multiple places (I guess
more of these SKIPs will need to be added over time, for now I am only
like 20% of the way through Project Run run_vmtests.sh Successfully), add
a dummy "skip tests for this specific reason" function that basically just
serves as a hook to hang comments on.
Link: https://lkml.kernel.org/r/20250317-9pfs-comments-v1-1-9ac96043e146@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant
requires less parameters and it uses exactly the same arguments which are
partly now hidden inside __vmalloc_node().
This change does not change any functionality. It makes the code a bit
simpler.
Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Briefly describe what zones are and the fields of struct zone.
Link: https://lkml.kernel.org/r/20250315211317.27612-1-jiwen7.qi@gmail.com
Signed-off-by: Jiwen Qi <jiwen7.qi@gmail.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update the NR_BALLOON_PAGES counter when pages are added to or removed
from the Xen balloon.
Link: https://lkml.kernel.org/r/20250314213757.244258-5-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update the NR_BALLOON_PAGES counter when pages are added to or removed
from the Hyper-V balloon.
Link: https://lkml.kernel.org/r/20250314213757.244258-4-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update the NR_BALLOON_PAGES counter when pages are added or removed using
the balloon compaction interface.
The virtio, Vmware, and pseries-cmm balloon drivers utilize the
balloon_compaction interface to allocate and free balloon pages. Other
balloon drivers will have to maintain this counter manually.
Link: https://lkml.kernel.org/r/20250314213757.244258-3-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "track memory used by balloon drivers", v2.
This series introduces a way to track memory used by balloon drivers.
Add a NR_BALLOON_PAGES counter to track how many pages are reclaimed by
the balloon drivers. First add the accounting, then updates the balloon
drivers (virtio, Hyper-V, VMware, Pseries-cmm, and Xen) to maintain this
counter. The virtio, Vmware, and pseries-cmm balloon drivers utilize the
balloon_compaction interface to allocate and free balloon pages. Other
balloon drivers will have to maintain this counter manually.
This makes the information visible in memory reporting interfaces like
/proc/meminfo, show_mem, and OOM reporting.
This provides admins visibility into their VM balloon sizes without
requiring different virtualization tooling. Furthermore, this information
is helpful when debugging an OOM inside a VM.
This patch (of 4):
Add NR_BALLOON_PAGES counter to track memory used by balloon drivers and
expose it through /proc/meminfo and other memory reporting interfaces.
[npache@redhat.com: document Balloon Meminfo entry]
Link: https://lkml.kernel.org/r/a0315ccf-f244-460e-8643-fd7388724fe5@redhat.com
Link: https://lkml.kernel.org/r/20250314213757.244258-1-npache@redhat.com
Link: https://lkml.kernel.org/r/20250314213757.244258-2-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This use of folios is misleading because these pages are not part of
a folio. Remove an unnecessary call to page_folio(), saving 58 bytes
of text in a Debian kernel build.
Link: https://lkml.kernel.org/r/20250314133617.138071-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There's no need to check which kind of pointer is in the memcg_data field,
all we actually care about is whether it's zero or not. Saves 70 bytes in
workingset_activation() with the Debian config.
Link: https://lkml.kernel.org/r/20250314133617.138071-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We know that the passed in page is not part of a folio (it's a plain page
allocated with GFP_ACCOUNT), so we should get rid of the misleading
references to folios.
Introduce page_objcg() and page_set_objcg() helpers to make things more
clear.
Link: https://lkml.kernel.org/r/20250314133617.138071-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The last argument to split_page_memcg() is now always 0, so remove it,
effectively reverting commit b8791381d7ed.
Link: https://lkml.kernel.org/r/20250314133617.138071-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Minor memcg cleanups & prep for memdescs", v2.
Separate the handling of accounted folios and GFP_ACCOUNT pages for easier
to understand code. For more detail, see
https://lore.kernel.org/linux-mm/Z9LwTOudOlCGny3f@casper.infradead.org/
This patch (of 5):
Folios always use memcg_data to refer to the mem_cgroup while pages
allocated with GFP_ACCOUNT have a pointer to the obj_cgroup. Since the
caller already knows what it has, split the function into two and then we
don't need to check.
Move the assignment of split folio memcg_data to the point where we set up
the other parts of the new folio. That leaves folio_split_memcg_refs()
just handling the memcg accounting.
Link: https://lkml.kernel.org/r/20250314133617.138071-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250314133617.138071-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The do_memsw_account() is used to enable or disable legacy memory+swap
accounting in memory cgroup. However with disabled CONFIG_MEMCG_V1, we
don't need to keep checking it. So, let's always return false for
!CONFIG_MEMCG_V1 configs.
Before the patch:
$ size mm/memcontrol.o
text data bss dec hex filename
49928 10736 4172 64836 fd44 mm/memcontrol.o
After the patch:
$ size mm/memcontrol.o
text data bss dec hex filename
49430 10480 4172 64082 fa52 mm/memcontrol.o
Link: https://lkml.kernel.org/r/20250312222552.3284173-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We never charge the page counters of root memcg, so there is no need to
put root memcg in the memcg stock. At the moment, refill_stock() can be
called from try_charge_memcg(), obj_cgroup_uncharge_pages() and
mem_cgroup_uncharge_skmem().
The try_charge_memcg() and mem_cgroup_uncharge_skmem() are never called
with root memcg, so those are fine. However obj_cgroup_uncharge_pages()
can potentially call refill_stock() with root memcg if the objcg object
has been reparented over to the root memcg. Let's just avoid
refill_stock() from obj_cgroup_uncharge_pages() for root memcg.
Link: https://lkml.kernel.org/r/20250313054812.2185900-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhockoc@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
function performs initialization of a struct page that would have been
deferred normally.
Rename it to init_deferred_page() to better reflect what the function does.
Link: https://lkml.kernel.org/r/20250225083017.567649-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__init_reserved_page_zone() function finds the zone for pfn and nid and
performs initialization of a struct page with that zone and nid. There is
nothing in that function about reserved pages and it is misnamed.
Rename it to __init_page_from_nid() to better reflect what the function
does.
Link: https://lkml.kernel.org/r/20250225083017.567649-2-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For different CMAs, concurrent allocation of CMA memory ideally should not
require synchronization using locks. Currently, a global cma_mutex lock
is employed to synchronize all CMA allocations, which can impact the
performance of concurrent allocations across different CMAs.
To test the performance impact, follow these steps:
1. Boot the kernel with the command line argument hugetlb_cma=30G to
allocate a 30GB CMA area specifically for huge page allocations. (note:
on my machine, which has 3 nodes, each node is initialized with 10G of
CMA)
2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
count=30 to fully utilize the CMA area by writing zeroes to a file in
/dev/shm.
3. Open three terminals and execute the following commands simultaneously:
(Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
pages] of CMA memory.)
On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
We attempt to allocate pages through the CMA debug interface and use the
time command to measure the duration of each allocation.
Performance comparison:
Without this patch With this patch
Terminal1 ~7s ~7s
Terminal2 ~14s ~8s
Terminal3 ~21s ~7s
To solve problem above, we could use per-CMA locks to improve concurrent
allocation performance. This would allow each CMA to be managed
independently, reducing the need for a global lock and thus improving
scalability and performance.
Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com
Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aisheng Dong <aisheng.dong@nxp.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prepare the emc2305 driver to use configuration from Device Tree nodes.
Switch to devm_thermal_of_cooling_device_register to simplify the
cleanup procedure, allowing the removal of emc2305_unset_tz and
emc2305_remove, which are no longer needed.
Signed-off-by: Florin Leotescu <florin.leotescu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20250321143308.4008623-4-florin.leotescu@oss.nxp.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Introduce OF support for Microchip emc2305 pwm fan controller.
Signed-off-by: Florin Leotescu <florin.leotescu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20250321143308.4008623-3-florin.leotescu@oss.nxp.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Introduce yaml schema for Microchip emc2305 pwm fan controller.
Signed-off-by: Florin Leotescu <florin.leotescu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20250321143308.4008623-2-florin.leotescu@oss.nxp.com
[groeck: Fixed comment line length, added 'maxItems: 1'
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Add pds_fwctl to the driver and fwctl documentation pages.
Link: https://patch.msgid.link/r/20250320194412.67983-7-shannon.nelson@amd.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|