Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Once again, the changes are dominated by cpufreq updates, but this
time the majority of them are cpufreq core changes, mostly related to
the introduction of policy locking guards and __free() usage, and
fixes related to boost handling.
Still, there is also a significant update of the intel_pstate driver
making it register an energy model when running on a hybrid platform
which is used for enabling energy-aware scheduling (EAS) if the driver
operates in the passive mode (and schedutil is used as the cpufreq
governor for all CPUs which is the passive mode default).
There are some amd-pstate driver updates too, for a good measure,
including the "Requested CPU Min frequency" BIOS option support and
new online/offline callbacks.
In the cpuidle space, the most significant change is the addition of a
C1 demotion on/off sysfs knob to intel_idle which should help some
users to configure their systems more precisely. There is also the
conversion of the PSCI cpuidle driver to a faux device one and there
are two small updates of cpuidle governors.
Device power management is also modified quite a bit, especially the
handling of devices with asynchronous suspend and resume enabled
during system transitions. They are now going to be handled more
asynchronously during suspend transitions and somewhat less
aggressively during resume transitions.
Apart from the above, the operating performance points (OPP) library
is now going to use mutex locking guards and scope-based cleanup
helpers and there is the usual bunch of assorted fixes and code
cleanups.
Specifics:
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian)
- Fix typos in energy model documentation and example driver code
(Moon Hee Lee, Atul Kumar Pant)
- Rearrange the energy model management code and add a new function
for adjusting a CPU energy model after adjusting the capacity of
the given CPU to it (Rafael Wysocki)
- Refactor cpufreq_online(), add and use cpufreq policy locking
guards, use __free() in policy reference counting, and clean up
core cpufreq code on top of that (Rafael Wysocki)
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar)
- Fix des_perf clamping with max_perf in amd_pstate_update()
(Dhananjay Ugwekar)
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it
(Dhananjay Ugwekar)
- Add support for the "Requested CPU Min frequency" BIOS option to
the amd-pstate driver (Dhananjay Ugwekar)
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal)
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor)
- Add helper for governor checks to the schedutil cpufreq governor
and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki)
- Populate the cpu_capacity sysfs entries from the intel_pstate
driver after registering asym capacity support (Ricardo Neri)
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki)
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab)
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu)
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng)
- OPP: Add dev_pm_opp_set_level() (Praveen Talari)
- Introduce scope-based cleanup headers and mutex locking guards in
OPP core (Viresh Kumar)
- Switch OPP to use kmemdup_array() (Zhang Enpei)
- Optimize bucket assignment when next_timer_ns equals KTIME_MAX in
the menu cpuidle governor (Zhongqiu Han)
- Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla)
- Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
Bityutskiy)
- Fix typos in two comments in the teo cpuidle governor (Atul Kumar
Pant)
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla)
- Move debug runtime PM attributes to runtime_attrs[] (Rafael
Wysocki)
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás)
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum)
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki)
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel)
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang)
- Add missing wakeup source attribute relax_count to sysfs and remove
the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu)
- Add configurable pm_test delay for hibernation (Zihuan Zhang)
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter)
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki)
- Add a systemd service to run cpupower and change cpupower binding's
Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)"
* tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
cpufreq: CPPC: Add support for autonomous selection
cpufreq: Update sscanf() to kstrtouint()
cpufreq: Replace magic number
OPP: switch to use kmemdup_array()
PM: freezer: Rewrite restarting tasks log to remove stray *done.*
PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
cpufreq: drop redundant cpus_read_lock() from store_local_boost()
cpupower: do not install files to /etc/default/
cpupower: do not call systemctl at install time
cpupower: do not write DESTDIR to cpupower.service
PM: sleep: Introduce pm_sleep_transition_in_progress()
cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
cpufreq: intel_pstate: Document hybrid processor support
cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
cpufreq: intel_pstate: EAS support for hybrid platforms
PM: EM: Introduce em_adjust_cpu_capacity()
PM: EM: Move CPU capacity check to em_adjust_new_capacity()
PM: EM: Documentation: Fix typos in example driver code
cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
PM: sleep: Introduce pm_suspend_in_progress()
...
|
|
Merge updates related to system sleep handling and runtime PM for 6.16-rc1:
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla).
- Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás).
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum).
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki).
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel).
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang).
- Add missing wakeup source attribute relax_count to sysfs and
remove the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu).
- Add configurable pm_test delay for hibernation (Zihuan Zhang).
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter).
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki).
* pm-runtime:
PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
PM: sysfs: Move debug runtime PM attributes to runtime_attrs[]
PM: runtime: Add new devm functions
* pm-sleep:
PM: freezer: Rewrite restarting tasks log to remove stray *done.*
PM: sleep: Introduce pm_sleep_transition_in_progress()
PM: sleep: Introduce pm_suspend_in_progress()
PM: sleep: Print PM debug messages during hibernation
ucsi_ccg: Disable async suspend in ucsi_ccg_probe()
PM: hibernate: add configurable delay for pm_test
PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()
PM: wakeup: Add missing wakeup source attribute relax_count
PM: sleep: Remove unnecessary !!
PM: sleep: Use two lines for "Restarting..." / "done" messages
PM: sleep: Make suspend of devices more asynchronous
PM: sleep: Suspend async parents after suspending children
PM: sleep: Resume children after resuming the parent
PM: hibernate: Remove size arguments when calling strscpy()
|
|
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
|
`pr_cont()` unfortunately does not work here, as other parts of the
Linux kernel log between the two log lines:
[18445.295056] r8152-cfgselector 4-1.1.3: USB disconnect, device number 5
[18445.295112] OOM killer enabled.
[18445.295115] Restarting tasks ...
[18445.295185] usb 3-1: USB disconnect, device number 2
[18445.295193] usb 3-1.1: USB disconnect, device number 3
[18445.296262] usb 3-1.5: USB disconnect, device number 4
[18445.297017] done.
[18445.297029] random: crng reseeded on system resumption
`pr_cont()` also uses the default log level, normally warning, if the
corresponding log line is interrupted.
Therefore, replace the `pr_cont()`, and explicitly log it as a separate
line with log level info:
Restarting tasks: Starting
[…]
Restarting tasks: Done
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://patch.msgid.link/20250511174648.950430-1-pmenzel@molgen.mpg.de
[ rjw: Rebase on top of an earlier analogous change ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The "suspend in progress" check in device_wakeup_enable() does not
cover hibernation, but arguably it should do that, so introduce
pm_sleep_transition_in_progress() covering transitions during both
system suspend and hibernation to use in there and use it also in
pm_debug_messages_should_print().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/7820474.EvYhyI6sBW@rjwysocki.net
[ rjw: Move the new function definition under CONFIG_PM_SLEEP ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add a function for updating the Energy Model for a CPU after its
capacity has changed, which subsequently will be used by the
intel_pstate driver.
An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_recalc_and_update()
to prevent it from calling em_compute_costs() for an "artificial" perf
domain with a NULL cb parameter which would cause it to crash.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3637203.iIbC2pHGDl@rjwysocki.net
|
|
Move the check of the CPU capacity currently stored in the energy model
against the arch_scale_cpu_capacity() value to em_adjust_new_capacity()
so it will be done regardless of where the latter is called from.
This will be useful when a new em_adjust_new_capacity() caller is added
subsequently.
While at it, move the pd local variable declaration in
em_check_capacity_update() into the loop in which it is used.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/7810787.EvYhyI6sBW@rjwysocki.net
|
|
Introduce pm_suspend_in_progress() to be used for checking if a system-
wide suspend or resume transition is in progress, instead of comparing
pm_suspend_target_state directly to PM_SUSPEND_ON, and use it where
applicable.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/2020901.PYKUYFuaPT@rjwysocki.net
|
|
Commit cdb8c100d8a4 ("include/linux/suspend.h: Only show pm_pr_dbg
messages at suspend/resume") caused PM debug messages to only be
printed during system-wide suspend and resume in progress, but it
forgot about hibernation.
Address this by adding a check for hibernation in progress to
pm_debug_messages_should_print().
Fixes: cdb8c100d8a4 ("include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/4998903.GXAFRqVoOG@rjwysocki.net
|
|
Link: https://lkml.kernel.org/r/20250423133821.789413-5-dwmw2@infradead.org
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Ruihan Li <lrh2000@pku.edu.cn>
Cc: Will Deacon <will@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Turn the default 5 second test delay for hibernation into a
configurable module parameter, so users can determine how
long to wait in this pseudo-hibernate state before resuming
the system.
The configurable delay parameter has been added for suspend, so
add an analogous one for hibernation.
Example (wait 30 seconds);
# echo 30 > /sys/module/hibernate/parameters/pm_test_delay
# echo core > /sys/power/pm_test
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20250507063520.419635-1-zhangzihuan@kylinos.cn
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
pm_show_wakelocks() is called to generate a string when showing
attributes /sys/power/wake_(lock|unlock), but the string ends
with an unwanted space that was added back by mistake by commit
c9d967b2ce40 ("PM: wakeup: simplify the output logic of
pm_show_wakelocks()").
Remove the unwanted space.
Fixes: c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()")
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://patch.msgid.link/20250505-fix_power-v1-1-0f7f2c2f338c@quicinc.com
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.
Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.
Link: https://lore.kernel.org/r/20250402-work-freeze-v2-3-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Split hib_submit_io into a sync and async version. The sync version is
a small wrapper around bdev_rw_virt which implements all the logic to
add a kernel direct mapping range to a bio and synchronously submits it,
while the async version is slightly simplified using the
bio_add_virt_nofail for adding the single range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().
Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.
Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since initcall_debug is a bool variable, it is not necessary to convert
it to bool with the help of a double logical negation (!!).
Remove the redundant operation.
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/20250424060339.73119-1-zhangzihuan@kylinos.cn
[ rjw: Changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Other messages are occasionally printed between these two, for example:
[203104.106534] Restarting tasks ...
[203104.106559] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
[203104.112354] done.
This seems to be a timing issue, seen in two of the eleven
hibernation exits in my current `dmesg` output.
When printed on its own, the "done" message has the default log level.
This makes the output of `dmesg --level=warn` quite misleading.
Add enough context for the "done" messages to make sense on their own,
and use the same log level for all messages.
Change the messages to "<event>..." / "Done <event>.", unlike a449dfbfc089
which uses "<event>..." / "<event> completed.". Front-loading the unique
part of the message makes it easier to scan the log, and reduces ambiguity
for users who aren't confident in their English comprehension.
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Andrew Sayers <kernel.org@pileofstuff.org>
Link: https://patch.msgid.link/20250411152632.2806038-1-kernel.org@pileofstuff.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The size parameter is optional and strscpy() automatically determines
the length of the destination buffer using sizeof() if the argument is
omitted. This makes the explicit sizeof() calls unnecessary. Remove
them to shorten and simplify the code.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250318080755.61126-2-thorsten.blum@linux.dev
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Remove legacy compression interface
- Improve scatterwalk API
- Add request chaining to ahash and acomp
- Add virtual address support to ahash and acomp
- Add folio support to acomp
- Remove NULL dst support from acomp
Algorithms:
- Library options are fuly hidden (selected by kernel users only)
- Add Kerberos5 algorithms
- Add VAES-based ctr(aes) on x86
- Ensure LZO respects output buffer length on compression
- Remove obsolete SIMD fallback code path from arm/ghash-ce
Drivers:
- Add support for PCI device 0x1134 in ccp
- Add support for rk3588's standalone TRNG in rockchip
- Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93
- Fix bugs in tegra uncovered by multi-threaded self-test
- Fix corner cases in hisilicon/sec2
Others:
- Add SG_MITER_LOCAL to sg miter
- Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp"
* tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (187 commits)
crypto: testmgr - Add multibuffer acomp testing
crypto: acomp - Fix synchronous acomp chaining fallback
crypto: testmgr - Add multibuffer hash testing
crypto: hash - Fix synchronous ahash chaining fallback
crypto: arm/ghash-ce - Remove SIMD fallback code path
crypto: essiv - Replace memcpy() + NUL-termination with strscpy()
crypto: api - Call crypto_alg_put in crypto_unregister_alg
crypto: scompress - Fix incorrect stream freeing
crypto: lib/chacha - remove unused arch-specific init support
crypto: remove obsolete 'comp' compression API
crypto: compress_null - drop obsolete 'comp' implementation
crypto: cavium/zip - drop obsolete 'comp' implementation
crypto: zstd - drop obsolete 'comp' implementation
crypto: lzo - drop obsolete 'comp' implementation
crypto: lzo-rle - drop obsolete 'comp' implementation
crypto: lz4hc - drop obsolete 'comp' implementation
crypto: lz4 - drop obsolete 'comp' implementation
crypto: deflate - drop obsolete 'comp' implementation
crypto: 842 - drop obsolete 'comp' implementation
crypto: nx - Migrate to scomp API
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- New option "printk.debug_non_panic_cpus" allows to store printk
messages from non-panic CPUs during panic. It might be useful when
panic() fails. It is disabled by default because it increases the
chance to see the messages printed before panic() and on the
panic-CPU.
- New build option "CONFIG_NULL_TTY_DEFAULT_CONSOLE" allows to build
kernel without the virtual terminal support which prefers ttynull
over serial console.
- Do not unblank suspended consoles.
- Some code clean up.
* tag 'printk-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/panic: Add option to allow non-panic CPUs to write to the ring buffer.
printk: Add an option to allow ttynull to be a default console device
printk: Check CON_SUSPEND when unblanking a console
printk: Rename console_start to console_resume
printk: Rename console_stop to console_suspend
printk: Rename resume_console to console_resume_all
printk: Rename suspend_console to console_suspend_all
|
|
Merge updates related to system sleep for 6.15-rc1 including fixes,
cleanups and a rework of the "smart suspend" driver flag handling to
avoid issues that may occur when drivers using it depend on some other
drivers:
- Rework the handling of the "smart suspend" driver flag in the PM core
to avoid issues hat may occur when drivers using it depend on some
other drivers and clean up the related PM core code (Rafael Wysocki,
Colin Ian King).
- Fix the handling of devices with the power.direct_complete flag set
if device_suspend() returns an error for at least one device to avoid
situations in which some of them may not be resumed (Rafael Wysocki).
- Use mutex_trylock() in hibernate_compressor_param_set() to avoid a
possible deadlock that may occur if the "compressor" hibernation
module parameter is accessed during the registration of a new
ieee80211 device (Lizhi Xu).
- Suppress sleeping parent warning in device_pm_add() in the case when
new children are added under a device with the power.direct_complete
set after it has been processed by device_resume() (Xu Yang).
- Remove needless return in three void functions related to system
wakeup (Zijun Hu).
- Replace deprecated kmap_atomic() with kmap_local_page() in the
hibernation core code (David Reaver).
- Remove unused helper functions related to system sleep (David Alan
Gilbert).
- Clean up s2idle_enter() so it does not lock and unlock CPU offline
in vain and update comments in it (Ulf Hansson).
- Clean up broken white space in dpm_wait_for_children() (Geert
Uytterhoeven).
* pm-sleep:
PM: sleep: Fix bit masking operation
PM: sleep: Fix handling devices with direct_complete set on errors
PM: sleep: core: Fix indentation in dpm_wait_for_children()
PM: s2idle: Extend comment in s2idle_enter()
PM: s2idle: Drop redundant locks when entering s2idle
PM: sleep: Remove unused pm_generic_ wrappers
PM: sleep: Rearrange dpm_async_fn() and async state clearing
PM: sleep: Rename power.async_in_progress to power.work_in_progress
PM: core: Tweak pm_runtime_block_if_disabled() return value
PM: runtime: Convert pm_runtime_blocked() to static inline
PM: sleep: Update power.smart_suspend under PM spinlock
PM: sleep: Adjust check before setting power.must_resume
PM: wakeup: Remove needless return in three void APIs
PM: sleep: Suppress sleeping parent warning in special case
PM: hibernate: Avoid deadlock in hibernate_compressor_param_set()
PM: sleep: Avoid unnecessary checks in device_prepare_smart_suspend()
PM: sleep: Use DPM_FLAG_SMART_SUSPEND conditionally
PM: runtime: Introduce pm_runtime_blocked()
PM: Block enabling of runtime PM during system suspend
PM: hibernate: Replace deprecated kmap_atomic() with kmap_local_page()
|
|
Replace the legacy crypto compression interface with the new acomp
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The s2idle_lock must be held while checking for a pending wakeup and while
moving into S2IDLE_STATE_ENTER, to make sure a wakeup doesn't get lost.
Let's extend the comment in the code to make this clear.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20250311160827.1129643-3-ulf.hansson@linaro.org
[ rjw: Rewrote the new comment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The calls to cpus_read_lock|unlock() protects us from getting CPUS
hotplugged, while entering suspend-to-idle. However, when s2idle_enter() is
called we should be far beyond the point when CPUs may be hotplugged.
Let's therefore simplify the code and drop the use of the lock.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20250311160827.1129643-2-ulf.hansson@linaro.org
[ rjw: Rewrote the new comment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The function resume_console has a misleading name, since it resumes all
consoles, so rename it accordingly.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20250226-printk-renaming-v1-2-0b878577f2e6@suse.com
[pmladek@suse.com: Fixed typo in the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The function suspend_console has a misleading name, since it suspends all
consoles, so rename it accordingly.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20250226-printk-renaming-v1-1-0b878577f2e6@suse.com
[pmladek@suse.com: Fixed typo in the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Now not only CPUs can use energy efficiency models, but GPUs
can also use. On the other hand, even with only one CPU, we can also
use energy_model to align control in thermal.
So remove the dependence of SMP, and add the DEVFREQ.
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
[Added missing SMP config option in DTPM_CPU dependency]
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20250307132649.4056210-1-lukasz.luba@arm.com
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The usage of __rcu in the Energy Model code is quite inconsistent
which causes the following sparse warnings to trigger:
kernel/power/energy_model.c:169:15: warning: incorrect type in assignment (different address spaces)
kernel/power/energy_model.c:169:15: expected struct em_perf_table [noderef] __rcu *table
kernel/power/energy_model.c:169:15: got struct em_perf_table *
kernel/power/energy_model.c:171:9: warning: incorrect type in argument 1 (different address spaces)
kernel/power/energy_model.c:171:9: expected struct callback_head *head
kernel/power/energy_model.c:171:9: got struct callback_head [noderef] __rcu *
kernel/power/energy_model.c:171:9: warning: cast removes address space '__rcu' of expression
kernel/power/energy_model.c:182:19: warning: incorrect type in argument 1 (different address spaces)
kernel/power/energy_model.c:182:19: expected struct kref *kref
kernel/power/energy_model.c:182:19: got struct kref [noderef] __rcu *
kernel/power/energy_model.c:200:15: warning: incorrect type in assignment (different address spaces)
kernel/power/energy_model.c:200:15: expected struct em_perf_table [noderef] __rcu *table
kernel/power/energy_model.c:200:15: got void *[assigned] _res
kernel/power/energy_model.c:204:20: warning: incorrect type in argument 1 (different address spaces)
kernel/power/energy_model.c:204:20: expected struct kref *kref
kernel/power/energy_model.c:204:20: got struct kref [noderef] __rcu *
kernel/power/energy_model.c:320:19: warning: incorrect type in argument 1 (different address spaces)
kernel/power/energy_model.c:320:19: expected struct kref *kref
kernel/power/energy_model.c:320:19: got struct kref [noderef] __rcu *
kernel/power/energy_model.c:325:45: warning: incorrect type in argument 2 (different address spaces)
kernel/power/energy_model.c:325:45: expected struct em_perf_state *table
kernel/power/energy_model.c:325:45: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:425:45: warning: incorrect type in argument 3 (different address spaces)
kernel/power/energy_model.c:425:45: expected struct em_perf_state *table
kernel/power/energy_model.c:425:45: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:442:15: warning: incorrect type in argument 1 (different address spaces)
kernel/power/energy_model.c:442:15: expected void const *objp
kernel/power/energy_model.c:442:15: got struct em_perf_table [noderef] __rcu *[assigned] em_table
kernel/power/energy_model.c:626:55: warning: incorrect type in argument 2 (different address spaces)
kernel/power/energy_model.c:626:55: expected struct em_perf_state *table
kernel/power/energy_model.c:626:55: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:681:16: warning: incorrect type in assignment (different address spaces)
kernel/power/energy_model.c:681:16: expected struct em_perf_state *new_ps
kernel/power/energy_model.c:681:16: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:699:37: warning: incorrect type in argument 2 (different address spaces)
kernel/power/energy_model.c:699:37: expected struct em_perf_state *table
kernel/power/energy_model.c:699:37: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:733:38: warning: incorrect type in argument 3 (different address spaces)
kernel/power/energy_model.c:733:38: expected struct em_perf_state *table
kernel/power/energy_model.c:733:38: got struct em_perf_state [noderef] __rcu *
kernel/power/energy_model.c:855:53: warning: dereference of noderef expression
kernel/power/energy_model.c:864:32: warning: dereference of noderef expression
This is because the __rcu annotation for sparse is only applicable to
pointers that need rcu_dereference() or equivalent for protection, which
basically means pointers assigned with rcu_assign_pointer().
Make all of the above sparse warnings go away by cleaning up the usage
of __rcu and using rcu_dereference_protected() where applicable.
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/5885405.DvuYhMxLoT@rjwysocki.net
|
|
Notice that em_dev_register_perf_domain() and the functions called by it
do not update objects pointed to by its cb and cpus parameters, so the
const modifier can be added to them.
This allows the return value of cpumask_of() or a pointer to a
struct em_data_callback declared as const to be passed to
em_dev_register_perf_domain() directly without explicit type
casting which is rather handy.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/4648962.LvFx2qVVIh@rjwysocki.net
|
|
syzbot reported a deadlock in lock_system_sleep() (see below).
The write operation to "/sys/module/hibernate/parameters/compressor"
conflicts with the registration of ieee80211 device, resulting in a deadlock
when attempting to acquire system_transition_mutex under param_lock.
To avoid this deadlock, change hibernate_compressor_param_set() to use
mutex_trylock() for attempting to acquire system_transition_mutex and
return -EBUSY when it fails.
Task flags need not be saved or adjusted before calling
mutex_trylock(&system_transition_mutex) because the caller is not going
to end up waiting for this mutex and if it runs concurrently with system
suspend in progress, it will be frozen properly when it returns to user
space.
syzbot report:
syz-executor895/5833 is trying to acquire lock:
ffffffff8e0828c8 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x87/0xa0 kernel/power/main.c:56
but task is already holding lock:
ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: kernel_param_lock kernel/params.c:607 [inline]
ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: param_attr_store+0xe6/0x300 kernel/params.c:586
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (param_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730
ieee80211_rate_control_ops_get net/mac80211/rate.c:220 [inline]
rate_control_alloc net/mac80211/rate.c:266 [inline]
ieee80211_init_rate_ctrl_alg+0x18d/0x6b0 net/mac80211/rate.c:1015
ieee80211_register_hw+0x20cd/0x4060 net/mac80211/main.c:1531
mac80211_hwsim_new_radio+0x304e/0x54e0 drivers/net/wireless/virtual/mac80211_hwsim.c:5558
init_mac80211_hwsim+0x432/0x8c0 drivers/net/wireless/virtual/mac80211_hwsim.c:6910
do_one_initcall+0x128/0x700 init/main.c:1257
do_initcall_level init/main.c:1319 [inline]
do_initcalls init/main.c:1335 [inline]
do_basic_setup init/main.c:1354 [inline]
kernel_init_freeable+0x5c7/0x900 init/main.c:1568
kernel_init+0x1c/0x2b0 init/main.c:1457
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
-> #2 (rtnl_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730
wg_pm_notification drivers/net/wireguard/device.c:80 [inline]
wg_pm_notification+0x49/0x180 drivers/net/wireguard/device.c:64
notifier_call_chain+0xb7/0x410 kernel/notifier.c:85
notifier_call_chain_robust kernel/notifier.c:120 [inline]
blocking_notifier_call_chain_robust kernel/notifier.c:345 [inline]
blocking_notifier_call_chain_robust+0xc9/0x170 kernel/notifier.c:333
pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102
snapshot_open+0x189/0x2b0 kernel/power/user.c:77
misc_open+0x35a/0x420 drivers/char/misc.c:179
chrdev_open+0x237/0x6a0 fs/char_dev.c:414
do_dentry_open+0x735/0x1c40 fs/open.c:956
vfs_open+0x82/0x3f0 fs/open.c:1086
do_open fs/namei.c:3830 [inline]
path_openat+0x1e88/0x2d80 fs/namei.c:3989
do_filp_open+0x20c/0x470 fs/namei.c:4016
do_sys_openat2+0x17a/0x1e0 fs/open.c:1428
do_sys_open fs/open.c:1443 [inline]
__do_sys_openat fs/open.c:1459 [inline]
__se_sys_openat fs/open.c:1454 [inline]
__x64_sys_openat+0x175/0x210 fs/open.c:1454
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 ((pm_chain_head).rwsem){++++}-{4:4}:
down_read+0x9a/0x330 kernel/locking/rwsem.c:1524
blocking_notifier_call_chain_robust kernel/notifier.c:344 [inline]
blocking_notifier_call_chain_robust+0xa9/0x170 kernel/notifier.c:333
pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102
snapshot_open+0x189/0x2b0 kernel/power/user.c:77
misc_open+0x35a/0x420 drivers/char/misc.c:179
chrdev_open+0x237/0x6a0 fs/char_dev.c:414
do_dentry_open+0x735/0x1c40 fs/open.c:956
vfs_open+0x82/0x3f0 fs/open.c:1086
do_open fs/namei.c:3830 [inline]
path_openat+0x1e88/0x2d80 fs/namei.c:3989
do_filp_open+0x20c/0x470 fs/namei.c:4016
do_sys_openat2+0x17a/0x1e0 fs/open.c:1428
do_sys_open fs/open.c:1443 [inline]
__do_sys_openat fs/open.c:1459 [inline]
__se_sys_openat fs/open.c:1454 [inline]
__x64_sys_openat+0x175/0x210 fs/open.c:1454
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (system_transition_mutex){+.+.}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3163 [inline]
check_prevs_add kernel/locking/lockdep.c:3282 [inline]
validate_chain kernel/locking/lockdep.c:3906 [inline]
__lock_acquire+0x249e/0x3c40 kernel/locking/lockdep.c:5228
lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5851
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730
lock_system_sleep+0x87/0xa0 kernel/power/main.c:56
hibernate_compressor_param_set+0x1c/0x210 kernel/power/hibernate.c:1452
param_attr_store+0x18f/0x300 kernel/params.c:588
module_attr_store+0x55/0x80 kernel/params.c:924
sysfs_kf_write+0x117/0x170 fs/sysfs/file.c:139
kernfs_fop_write_iter+0x33d/0x500 fs/kernfs/file.c:334
new_sync_write fs/read_write.c:586 [inline]
vfs_write+0x5ae/0x1150 fs/read_write.c:679
ksys_write+0x12b/0x250 fs/read_write.c:731
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
system_transition_mutex --> rtnl_mutex --> param_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(param_lock);
lock(rtnl_mutex);
lock(param_lock);
lock(system_transition_mutex);
*** DEADLOCK ***
Reported-by: syzbot+ace60642828c074eb913@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ace60642828c074eb913
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Link: https://patch.msgid.link/20250224013139.3994500-1-lizhi.xu@windriver.com
[ rjw: New subject matching the code changes, changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The callback function of call_rcu() just calls kfree(), so use
kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20250218082021.2766-1-lirongqing@baidu.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Every iteration of the loop over all possible CPUs in
em_check_capacity_update() causes get_cpu_device() to be called twice
for the same CPU, once indirectly via em_cpu_get() and once directly.
Get rid of the indirect get_cpu_device() call by moving the direct
invocation of it earlier and using em_pd_get() instead of em_cpu_get()
to get a pd pointer for the dev one returned by it.
This also exposes the fact that dev is needed to get a pd, so the code
becomes somewhat easier to follow after it.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/1925950.tdWV9SEqCh@rjwysocki.net
|
|
The max_cap parameter is never used in em_adjust_new_capacity(), so
drop it.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/2369979.ElGaqSPkdT@rjwysocki.net
|
|
kmap_atomic() is deprecated and should be replaced with kmap_local_page()
[1][2]. kmap_local_page() is faster in kernels with HIGHMEM enabled, can
take page faults, and allows preemption.
According to [2], this replacement is safe as long as the code between
kmap_atomic() and kunmap_atomic() does not implicitly depend on disabling
page faults or preemption. In all of the call sites in this patch, the only
thing happening between mapping and unmapping pages is copy_page() calls,
and I don't suspect they depend on disabling page faults or preemption.
Link: https://lwn.net/Articles/836144/ [1]
Link: https://docs.kernel.org/mm/highmem.html#temporary-virtual-mappings [2]
Signed-off-by: David Reaver <me@davidreaver.com>
Link: https://patch.msgid.link/20250112152658.20132-1-me@davidreaver.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly fixes on top of the previously merged power
management material with the addition of some teo cpuidle governor
updates, some of which may also be regarded as fixes:
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang)
- Revert a commit that added unused macros (Andy Shevchenko)
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed
during the subsequent system-wide resume transition (Rafael
Wysocki)
- Clean up the teo cpuidle governor and make the handling of short
idle intervals in it consistent regardless of the properties of
idle states supplied by the cpuidle driver (Rafael Wysocki)
- Fix some boost-related issues in cpufreq (Lifeng Zheng)
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar)
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can
happen from any CPU (Christian Loehle)"
* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Synchronize runtime PM status of parents and children
cpufreq: airoha: Depends on OF
PM: Revert "Add EXPORT macros for exporting PM functions"
PM: hibernate: Add error handling for syscore_suspend()
cpufreq/schedutil: Only bind threads if needed
cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
cpufreq: CPPC: Fix wrong max_freq in policy initialization
cpufreq: Introduce a more generic way to set default per-policy boost flag
cpufreq: Fix re-boost issue after hotplugging a CPU
cpufreq: s3c64xx: Fix compilation warning
cpuidle: teo: Skip sleep length computation for low latency constraints
cpuidle: teo: Replace time_span_ns with a flag
cpuidle: teo: Simplify handling of total events count
cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
cpuidle: teo: Simplify counting events used for tick management
cpuidle: teo: Clarify two code comments
cpuidle: teo: Drop local variable prev_intercept_idx
cpuidle: teo: Combine candidate state index checks against 0
cpuidle: teo: Reorder candidate state index checks
cpuidle: teo: Rearrange idle state lookup code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
|
|
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory. In most cases, when memory allocation fails, an
immediate panic is required. To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`. This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.
[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In hibernation_platform_enter(), the code did not check the
return value of syscore_suspend(), potentially leading to a
situation where syscore_resume() would be called even if
syscore_suspend() failed. This could cause unpredictable
behavior or system instability.
Modify the code sequence in question to properly handle errors returned
by syscore_suspend(). If an error occurs in the suspend path, the code
now jumps to label 'Enable_irqs' skipping the syscore_resume() call and
only enabling interrupts after setting the system state to SYSTEM_RUNNING.
Fixes: 40dc166cb5dd ("PM / Core: Introduce struct syscore_ops for core subsystems PM")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Link: https://patch.msgid.link/20250119143205.2103-1-vulab@iscas.ac.cn
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Merge updates related to system sleep, a cpuidle update and an Energy
Model handling code update for 6.14-rc1:
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson).
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang).
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap).
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy).
- Move sched domains rebuild function from the schedutil cpufreq
governor to the Energy Model handling code (Rafael Wysocki).
* pm-sleep:
PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
PM: sleep: convert comment from kernel-doc to plain comment
PM: wakeup: implement devm_device_init_wakeup() helper
PM: sleep: sysfs: don't include 'pm_wakeup.h' directly
PM: sleep: autosleep: don't include 'pm_wakeup.h' directly
PM: sleep: Update stale comment in device_resume()
* pm-cpuidle:
intel_idle: add Clearwater Forest SoC support
* pm-em:
PM: EM: Move sched domains rebuild function from schedutil to EM
|
|
Allow configuring the DPM watchdog to warn about slow suspend/resume
functions without causing a system panic(). This allows you to set the
DPM_WATCHDOG_WARNING_TIMEOUT to something like 5 or 10 seconds to get
warnings about slow suspend/resume functions that eventually succeed.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Tomasz Figa <tfiga@chromium.org>
Link: https://patch.msgid.link/20250109125957.v2.1.I4554f931b8da97948f308ecc651b124338ee9603@changeid
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Modify a non-kernel-doc comment to begin with /* instead of /**
so that it does not cause a kernel-doc warning.
power.h:114: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Auxiliary structure used for reading the snapshot image data and
power.h:114: warning: missing initial short description on line:
* Auxiliary structure used for reading the snapshot image data and
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Link: https://patch.msgid.link/20250111063107.910825-1-rdunlap@infradead.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Function sugov_eas_rebuild_sd() defined in the schedutil cpufreq governor
implements generic functionality that may be useful in other places. In
particular, there is a plan to use it in the intel_pstate driver in the
future.
For this reason, move it from schedutil to the energy model code and
rename it to em_rebuild_sched_domains().
This also helps to get rid of some #ifdeffery in schedutil which is a
plus.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
|
|
The header clearly states that it does not want to be included directly,
only via 'device.h'. 'platform_device.h' works equally well. Remove the
direct inclusion.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://patch.msgid.link/20241118072917.3853-16-wsa+renesas@sang-engineering.com
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pull kvm updates from Paolo Bonzini:
"The biggest change here is eliminating the awful idea that KVM had of
essentially guessing which pfns are refcounted pages.
The reason to do so was that KVM needs to map both non-refcounted
pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
VMAs that contain refcounted pages.
However, the result was security issues in the past, and more recently
the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
struct page but is not refcounted. In particular this broke virtio-gpu
blob resources (which directly map host graphics buffers into the
guest as "vram" for the virtio-gpu device) with the amdgpu driver,
because amdgpu allocates non-compound higher order pages and the tail
pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the
per-architecture code, to always work on the pfn whenever possible.
The large series that did this, from David Stevens and Sean
Christopherson, also cleaned up substantially the set of functions
that provided arch code with the pfn for a host virtual addresses.
The previous maze of twisty little passages, all different, is
replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
non-__ versions of these two, and kvm_prefetch_pages) saving almost
200 lines of code.
ARM:
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
call was introduced in PSCIv1.3 as a mechanism to request
hibernation, similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested
synchronous external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
- Add iocsr and mmio bus simulation in kernel.
- Add in-kernel interrupt controller emulation.
- Add support for virtualization extensions to the eiointc irqchip.
PPC:
- Drop lingering and utterly obsolete references to PPC970 KVM, which
was removed 10 years ago.
- Fix incorrect documentation references to non-existing ioctls
RISC-V:
- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
s390:
- New selftests: more ucontrol selftests and CPU model sanity checks
- Support for the gen17 CPU model
- List registers supported by KVM_GET/SET_ONE_REG in the
documentation
x86:
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
improve documentation, harden against unexpected changes.
Even if the hardware A/D tracking is disabled, it is possible to
use the hardware-defined A/D bits to track if a PFN is Accessed
and/or Dirty, and that removes a lot of special cases.
- Elide TLB flushes when aging secondary PTEs, as has been done in
x86's primary MMU for over 10 years.
- Recover huge pages in-place in the TDP MMU when dirty page logging
is toggled off, instead of zapping them and waiting until the page
is re-accessed to create a huge mapping. This reduces vCPU jitter.
- Batch TLB flushes when dirty page logging is toggled off. This
reduces the time it takes to disable dirty logging by ~3x.
- Remove the shrinker that was (poorly) attempting to reclaim shadow
page tables in low-memory situations.
- Clean up and optimize KVM's handling of writes to
MSR_IA32_APICBASE.
- Advertise CPUIDs for new instructions in Clearwater Forest
- Quirk KVM's misguided behavior of initialized certain feature MSRs
to their maximum supported feature set, which can result in KVM
creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
a non-zero value results in the vCPU having invalid state if
userspace hides PDCM from the guest, which in turn can lead to
save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support
LA57 to better follow the "architecture", in quotes because the
actual behavior is poorly documented. E.g. most MSR writes and
descriptor table loads ignore CR4.LA57 and operate purely on
whether the CPU supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(),
as filling the cache from IRQ context is generally unsafe; harden
the cache accessors to try to prevent similar issues from occuring
in the future. The issue that triggered this change was already
fixed in 6.12, but was still kinda latent.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where
KVM over-advertises SPEC_CTRL when trying to support cross-vendor
VMs.
- Minor cleanups
- Switch hugepage recovery thread to use vhost_task.
These kthreads can consume significant amounts of CPU time on
behalf of a VM or in response to how the VM behaves (for example
how it accesses its memory); therefore KVM tried to place the
thread in the VM's cgroups and charge the CPU time consumed by that
work to the VM's container.
However the kthreads did not process SIGSTOP/SIGCONT, and therefore
cgroups which had KVM instances inside could not complete freezing.
Fix this by replacing the kthread with a PF_USER_WORKER thread, via
the vhost_task abstraction. Another 100+ lines removed, with
generally better behavior too like having these threads properly
parented in the process tree.
- Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
didn't really work; there was really nothing to work around anyway:
the broken patch was meant to fix nested virtualization, but the
PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
erratum.
- Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
'y'.
x86 selftests:
- x86 selftests can now use AVX.
Documentation:
- Use rST internal links
- Reorganize the introduction to the API document
Generic:
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
instead of RCU, so that running a vCPU on a different task doesn't
encounter long due to having to wait for all CPUs become quiescent.
In general both reads and writes are rare, but userspace that
supports confidential computing is introducing the use of "helper"
vCPUs that may jump from one host processor to another. Those will
be very happy to trigger a synchronize_rcu(), and the effect on
performance is quite the disaster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
KVM: x86: add back X86_LOCAL_APIC dependency
Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
KVM: x86: switch hugepage recovery thread to vhost_task
KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Documentation: KVM: fix malformed table
irqchip/loongson-eiointc: Add virt extension support
LoongArch: KVM: Add irqfd support
LoongArch: KVM: Add PCHPIC user mode read and write functions
LoongArch: KVM: Add PCHPIC read and write functions
LoongArch: KVM: Add PCHPIC device support
LoongArch: KVM: Add EIOINTC user mode read and write functions
LoongArch: KVM: Add EIOINTC read and write functions
LoongArch: KVM: Add EIOINTC device support
LoongArch: KVM: Add IPI user mode read and write function
LoongArch: KVM: Add IPI read and write function
LoongArch: KVM: Add IPI device support
LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
KVM: arm64: Pass on SVE mapping failures
...
|
|
On some devices there are HW dependencies for shared frequency and voltage
between devices. It will impact Energy Aware Scheduler (EAS) decision,
where CPUs share the voltage & frequency domain with other CPUs or devices
e.g.
- Mid CPUs + Big CPU
- Little CPU + L3 cache in DSU
- some other device + Little CPUs
Detailed explanation of one example:
When the L3 cache frequency is increased, the affected Little CPUs might
run at higher voltage and frequency. That higher voltage causes higher CPU
power and thus more energy is used for running the tasks. This is
important for background running tasks, which try to run on energy
efficient CPUs.
Therefore, add performance state limits which are applied for the device
(in this case CPU). This is important on SoCs with HW dependencies
mentioned above so that the Energy Aware Scheduler (EAS) does not use
performance states outside the valid min-max range for energy calculation.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20241030164126.1263793-2-lukasz.luba@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function
which is analogous to ACPI S4 state. This will allow hosting
environments to determine that a guest is hibernated rather than just
powered off, and handle that state appropriately on subsequent launches.
Since commit 60c0d45a7f7a ("efi/arm64: use UEFI for system reset and
poweroff") the EFI shutdown method is deliberately preferred over PSCI
or other methods. So register a SYS_OFF_MODE_POWER_OFF handler which
*only* handles the hibernation, leaving the original PSCI SYSTEM_OFF as
a last resort via the legacy pm_power_off function pointer.
The hibernation code already exports a system_entering_hibernation()
function which is be used by the higher-priority handler to check for
hibernation. That existing function just returns the value of a static
boolean variable from hibernate.c, which was previously only set in the
hibernation_platform_enter() code path. Set the same flag in the simpler
code path around the call to kernel_power_off() too.
An alternative way to hook SYSTEM_OFF2 into the hibernation code would
be to register a platform_hibernation_ops structure with an ->enter()
method which makes the new SYSTEM_OFF2 call. But that would have the
unwanted side-effect of making hibernation take a completely different
code path in hibernation_platform_enter(), invoking a lot of special dpm
callbacks.
Another option might be to add a new SYS_OFF_MODE_HIBERNATE mode, with
fallback to SYS_OFF_MODE_POWER_OFF. Or to use the sys_off_data to
indicate whether the power off is for hibernation.
But this version works and is relatively simple.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-7-dwmw2@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When saveable_highmem_page() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:
kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function]
1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p)
| ^~~~~~~~~~~~~~~~~~~~~
Fix this by removing unused stub.
See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-3-luoxueqin@kylinos.cn
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|