summaryrefslogtreecommitdiff
path: root/kernel/cgroup
AgeCommit message (Collapse)Author
14 daysMerge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support. Quotes are appropriate because (at 6k lines and 100+ commits) it is much bigger than the rest, which will come later this week and consists mostly of bugfixes and selftests. s390 changes will also come in the second batch. ARM: - Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance. - Enable nested virtualisation support on systems that support it, though it is disabled by default. - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. - Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. - Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. - New selftest checking the pKVM ownership transition rules - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. - Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. - Add a new selftest for the SVE host state being corrupted by a guest. - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. - and the usual random cleanups. LoongArch: - Don't flush tlb if the host supports hardware page table walks. - Add KVM selftests support. RISC-V: - Add vector registers to get-reg-list selftest - VCPU reset related improvements - Remove scounteren initialization from VCPU reset - Support VCPU reset from userspace using set_mpstate() ioctl x86: - Initial support for TDX in KVM. This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module. This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit 7bcf7246c42a ('Merge branch 'kvm-tdx-finish-initial' into HEAD')" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits) x86/tdx: mark tdh_vp_enter() as __flatten Documentation: virt/kvm: remove unreferenced footnote RISC-V: KVM: lock the correct mp_state during reset KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET RISC-V: KVM: Remove scounteren initialization KVM: RISC-V: remove unnecessary SBI reset state ...
2025-05-27Merge tag 'cgroup-for-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...
2025-05-26Merge tag 'loongarch-kvm-6.16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.16 1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support.
2025-05-22sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifierTejun Heo
Replace explicit cgroup_bpf_inherit/offline() calls from cgroup creation/destruction paths with notification callback registered on cgroup_lifetime_notifier. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22sched_ext: Introduce cgroup_lifetime_notifierTejun Heo
Other subsystems may make use of the cgroup hierarchy with the cgroup_bpf support being one such example. For such a feature, it's useful to be able to hook into cgroup creation and destruction paths to perform feature-specific initializations and cleanups. Add cgroup_lifetime_notifier which generates CGROUP_LIFETIME_ONLINE and CGROUP_LIFETIME_OFFLINE events whenever cgroups are created and destroyed, respectively. The next patch will convert cgroup_bpf to use the new notifier and other uses are planned. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-22cgroup: Minor reorganization of cgroup_create()Tejun Heo
cgroup_bpf init and exit handling will be moved to a notifier chain. In prepartion, reorganize cgroup_create() a bit so that the new cgroup is fully initialized before any outside changes are made. - cgrp->ancestors[] initialization and the hierarchical nr_descendants and nr_frozen_descendants updates were in the same loop. Separate them out and do the former earlier and do the latter later. - Relocate cgroup_bpf_inherit() call so that it's after all cgroup initializations are complete. No visible behavior changes expected. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21cgroup: avoid per-cpu allocation of size zero rstat cpu locksJP Kobryn
Subsystem rstat locks are dynamically allocated per-cpu. It was discovered that a panic can occur during this allocation when the lock size is zero. This is the case on non-smp systems, since arch_spinlock_t is defined as an empty struct. Prevent this allocation when !CONFIG_SMP by adding a pre-processor conditional around the affected block. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reported-by: Klara Modin <klarasmodin@gmail.com> Fixes: 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: helper for checking rstat participation of cssJP Kobryn
There are a few places where a conditional check is performed to validate a given css on its rstat participation. This new helper tries to make the code more readable where this check is performed. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: use subsystem-specific rstat locks to avoid contentionJP Kobryn
It is possible to eliminate contention between subsystems when updating/flushing stats by using subsystem-specific locks. Let the existing rstat locks be dedicated to the cgroup base stats and rename them to reflect that. Add similar locks to the cgroup_subsys struct for use with individual subsystems. Lock initialization is done in the new function ss_rstat_init(ss) which replaces cgroup_rstat_boot(void). If NULL is passed to this function, the global base stat locks will be initialized. Otherwise, the subsystem locks will be initialized. Change the existing lock helper functions to accept a reference to a css. Then within these functions, conditionally select the appropriate locks based on the subsystem affiliation of the given css. Add helper functions for this selection routine to avoid repeated code. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: use separate rstat trees for each subsystemJP Kobryn
Different subsystems may call cgroup_rstat_updated() within the same cgroup, resulting in a tree of pending updates from multiple subsystems. When one of these subsystems is flushed via cgroup_rstat_flushed(), all other subsystems with pending updates on the tree will also be flushed. Change the paradigm of having a single rstat tree for all subsystems to having separate trees for each subsystem. This separation allows for subsystems to perform flushes without the side effects of other subsystems. As an example, flushing the cpu stats will no longer cause the memory stats to be flushed and vice versa. In order to achieve subsystem-specific trees, change the tree node type from cgroup to cgroup_subsys_state pointer. Then remove those pointers from the cgroup and instead place them on the css. Finally, change update/flush functions to make use of the different node type (css). These changes allow a specific subsystem to be associated with an update or flush. Separate rstat trees will now exist for each unique subsystem. Since updating/flushing will now be done at the subsystem level, there is no longer a need to keep track of updated css nodes at the cgroup level. The list management of these nodes done within the cgroup (rstat_css_list and related) has been removed accordingly. Conditional guards for checking validity of a given css were placed within css_rstat_updated/flush() to prevent undefined behavior occuring from kfunc usage in bpf programs. Guards were also placed within css_rstat_init/exit() in order to help consolidate calls to them. At call sites for all four functions, the existing guards were removed. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: compare css to cgroup::self in helper for distingushing cssJP Kobryn
Adjust the implementation of css_is_cgroup() so that it compares the given css to cgroup::self. Rename the function to css_is_self() in order to reflect that. Change the existing css->ss NULL check to a warning in the true branch. Finally, adjust call sites to use the new function name. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: warn on rstat usage by early init subsystemsJP Kobryn
An early init subsystem that attempts to make use of rstat can lead to failures during early boot. The reason for this is the timing in which the css's of the root cgroup have css_online() invoked on them. At the point of this call, there is a stated assumption that a cgroup has "successfully completed all allocations" [0]. An example of a subsystem that relies on the previously mentioned assumption [0] is the memory subsystem. Within its implementation of css_online(), work is queued to asynchronously begin flushing via rstat. In the early init path for a given subsystem, having rstat enabled leads to this sequence: cgroup_init_early() for_each_subsys(ss, ssid) if (ss->early_init) cgroup_init_subsys(ss, true) cgroup_init_subsys(ss, early_init) css = ss->css_alloc(...) init_and_link_css(css, ss, ...) ... online_css(css) online_css(css) ss = css->ss ss->css_online(css) Continuing to use the memory subsystem as an example, the issue with this sequence is that css_rstat_init() has not been called yet. This means there is now a race between the pending async work to flush rstat and the call to css_rstat_init(). So a flush can occur within the given cgroup while the rstat fields are not initialized. Since we are in the early init phase, the rstat fields cannot be initialized because they require per-cpu allocations. So it's not possible to have css_rstat_init() called early enough (before online_css()). This patch treats the combination of early init and rstat the same as as other invalid conditions. [0] Documentation/admin-guide/cgroup-v1/cgroups.rst (section: css_online) Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-12vmscan,cgroup: apply mems_effective to reclaimGregory Price
It is possible for a reclaimer to cause demotions of an lruvec belonging to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply this limitation based on the lruvec's memcg and prevent demotion. Notably, this may still allow demotion of shared libraries or any memory first instantiated in another cgroup. This means cpusets still cannot cannot guarantee complete isolation when demotion is enabled, and the docs have been updated to reflect this. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently - with the noted exceptions. Note on locking: The cgroup_get_e_css reference protects the css->effective_mems, and calls of this interface would be subject to the same race conditions associated with a non-atomic access to cs->effective_mems. So while this interface cannot make strong guarantees of correctness, it can therefore avoid taking a global or rcu_read_lock for performance. Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12cpuset: rename cpuset_node_allowed to cpuset_current_node_allowedGregory Price
Patch series "vmscan: enforce mems_effective during demotion", v5. Change reclaim to respect cpuset.mems_effective during demotion when possible. Presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. Implement cpuset_node_allowed() to check the cpuset.mems_effective associated wih the mem_cgroup of the lruvec being scanned. This only applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy than mem_cgroup in v1. This requires renaming the existing cpuset_node_allowed() to be cpuset_current_now_allowed() - which is more descriptive anyway - to implement the new cpuset_node_allowed() which takes a target cgroup. This patch (of 2): Rename cpuset_node_allowed to reflect that the function checks the current task's cpuset.mems. This allows us to make a new cpuset_node_allowed function that checks a target cgroup's cpuset.mems. Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-09cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasksWaiman Long
Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") enabled us to pull CPUs dedicated to child partitions from tasks in top_cpuset by ignoring per cpu kthreads. However, there can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY flag set to indicate that we shouldn't mess with their CPU affinity. For other kthreads, their affinity will be changed to skip CPUs dedicated to child partitions whether it is an isolating or a scheduling one. As all the per cpu kthreads have PF_NO_SETAFFINITY set, the PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads. Fix this issue by dropping the kthread_is_per_cpu() check and checking the PF_NO_SETAFFINITY flag instead. Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-09cgroup/cpuset: drop useless cpumask_empty() in ↵Yury Norov
compute_effective_exclusive_cpumask() Empty cpumasks can't intersect with any others. Therefore, testing for non-emptyness is useless. Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25cgroup/rstat: Improve cgroup_rstat_push_children() documentationWaiman Long
The cgroup_rstat_push_children() function converts a set of updated_children lists from different cgroups into a single ordered list of cgroups to be flushed via the rstat_flush_next pointer. The algorithm used isn't that well illustrated and it takes time to grasp what it is doing. Improve the embedded documentation and variable names to better illustrate the transformation process and make the code easier to understand. Also cgroup_rstat_lock must be held for the whole duration from where the rstat_flush_next list is being constructed in cgroup_rstat_push_children() to when it is consumed later in css_rstat_flush(). Otherwise, list corruption can happen leading to system crash as reported in [1]. In this particular case, the branch being used has commit 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") which breaks this rule, but is missing the fix commit 7d6c63c31914 ("cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock") that fixes it. This patch has no functional change. [1] https://lore.kernel.org/lkml/BY5PR04MB68495E9E8A46CA9614D62669BCBB2@BY5PR04MB6849.namprd04.prod.outlook.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-24cgroup: fix goto ordering in cgroup_init()JP Kobryn
Go to the appropriate section labels when css_rstat_init() or psi_cgroup_alloc() fails. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Fixes: a97915559f5c ("cgroup: change rstat function signatures from cgroup-based to css-based") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-24Merge branch 'kvm-fixes-6.15-rc4' into HEADPaolo Bonzini
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state.
2025-04-21cgroup: fix pointer check in css_rstat_init()JP Kobryn
In css_rstat_init() allocations are done for the cgroup's pointers rstat_cpu and rstat_base_cpu. Make sure the allocation checks are consistent with what they are allocating. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-17cgroup/cpuset-v1: Add missing support for cpuset_v2_modeT.J. Mercier
Android has mounted the v1 cpuset controller using filesystem type "cpuset" (not "cgroup") since 2015 [1], and depends on the resulting behavior where the controller name is not added as a prefix for cgroupfs files. [2] Later, a problem was discovered where cpu hotplug onlining did not affect the cpuset/cpus files, which Android carried an out-of-tree patch to address for a while. An attempt was made to upstream this patch, but the recommendation was to use the "cpuset_v2_mode" mount option instead. [3] An effort was made to do so, but this fails with "cgroup: Unknown parameter 'cpuset_v2_mode'" because commit e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not update the special cased cpuset_mount(), and only the cgroup (v1) filesystem type was updated. Add parameter parsing to the cpuset filesystem type so that cpuset_v2_mode works like the cgroup filesystem type: $ mkdir /dev/cpuset $ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset $ mount|grep cpuset none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent) [1] https://cs.android.com/android/_/android/platform/system/core/+/b769c8d24fd7be96f8968aa4c80b669525b930d3 [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192 [3] https://lore.kernel.org/lkml/f795f8be-a184-408a-0b5a-553d26061385@redhat.com/T/ Fixes: e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-17cgroup: Fix compilation issue due to cgroup_mutex not being exportedgaoxu
When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined! This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex) within folio_memcg. The export setting for cgroup_mutex is controlled by the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while CONFIG_PROVE_RCU is not, this compilation error will occur. To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to ensure cgroup_mutex is properly exported when needed. Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08Merge tag 'cgroup-for-6.15-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - A number of cpuset remote partition related fixes and cleanups along with selftest updates. - A change from this merge window made cgroup_rstat_updated_list() called outside cgroup_rstat_lock leading to list corruptions. Fix it by relocating the call inside the lock. * tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Fix race between newly created partition and dying one cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh selftest/cgroup: Clean up and restructure test_cpuset_prs.sh selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs and state separator cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it cgroup/cpuset: Code cleanup and comment update cgroup/cpuset: Don't allow creation of local partition over a remote one cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition cgroup/cpuset: Fix error handling in remote_partition_disable() cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
2025-04-07cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUsWaiman Long
Add WARN_ON_ONCE() statements whenever new exclusive CPUs are being added to a partition root to catch inconsistency in the way exclusive CPUs are being handled in the cpuset code. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07cgroup/cpuset: Fix obsolete comment in cpuset_css_offline()Waiman Long
Move the partition disablement comment from cpuset_css_offline() to cpuset_css_killed() and reword it to remove the remnant of sched.partition. Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07cgroup/cpuset: Always use cpu_active_maskWaiman Long
The current cpuset code uses both cpu_active_mask and cpu_online_mask and it can be confusing which one should be used if we need to update the code. The top_cpuset is always synchronized to cpu_active_mask and we should avoid using cpu_online_mask as much as possible. An active CPU is always an online CPU, but not vice versa. cpu_active_mask and cpu_online_mask can differ during hotplug operations. A CPU is marked active at the last stage of CPU bringup (CPUHP_AP_ACTIVE). It is also the stage where cpuset hotplug code will be called to update the sched domains so that the scheduler can move a normal task to a newly active CPU or remove tasks away from a newly inactivated CPU. The online bit is set much earlier in the CPU bringup process and cleared much later in CPU teardown. If cpu_online_mask is used while a hotunplug operation is happening in parallel, we may leave an offline CPU in cpu_allowed or have a higher chance of leaving an offline CPU in some other masks. Avoid this problem by always using cpu_active_mask in the cpuset code and leave a comment as to why the use of cpu_online_mask is discouraged. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-04cgroup: change rstat function signatures from cgroup-based to css-basedJP Kobryn
This non-functional change serves as preparation for moving to subsystem-based rstat trees. To simplify future commits, change the signatures of existing cgroup-based rstat functions to become css-based and rename them to reflect that. Though the signatures have changed, the implementations have not. Within these functions use the css->cgroup pointer to obtain the associated cgroup and allow code to function the same just as it did before this patch. At applicable call sites, pass the subsystem-specific css pointer as an argument or pass a pointer to cgroup::self if not in subsystem context. Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children() are not altered yet since there would be a larger amount of css to cgroup conversions which may overcomplicate the code at this intermediate phase. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04cgroup: add helper for checking when css is cgroup::selfJP Kobryn
The cgroup struct has a css field called "self". The main difference between this css and the others found in the cgroup::subsys array is that cgroup::self has a NULL subsystem pointer. There are several places where checks are performed to determine whether the css in question is cgroup::self or not. Instead of accessing css->ss directly, introduce a helper function that shows the intent and use where applicable. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04cgroup: move rstat base stat objects into their own structJP Kobryn
This non-functional change serves as preparation for moving to subsystem-based rstat trees. The base stats are not an actual subsystem, but in future commits they will have exclusive rstat trees just as other subsystems will. Moving the base stat objects into a new struct allows the cgroup_rstat_cpu struct to become more compact since it now only contains the minimum amount of pointers needed for rstat participation. Subsystems will (in future commits) make use of the compact cgroup_rstat_cpu struct while avoiding the memory overhead of the base stat objects which they will not use. An instance of the new struct cgroup_rstat_base_cpu was placed on the cgroup struct so it can retain ownership of these base stats common to all cgroups. A helper function was added for looking up the cpu-specific base stats of a given cgroup. Finally, initialization and variable names were adjusted where applicable. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-01cgroup/cpuset: Fix race between newly created partition and dying oneWaiman Long
There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay. As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration. This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty. One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed. Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit 33c35aa48178 ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization. Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-01Merge tag 'driver-core-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updatesk from Greg KH: "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates" * tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits) rust: platform: require Send for Driver trait implementers rust: pci: require Send for Driver trait implementers rust: platform: impl Send + Sync for platform::Device rust: pci: impl Send + Sync for pci::Device rust: platform: fix unrestricted &mut platform::Device rust: pci: fix unrestricted &mut pci::Device rust: device: implement device context marker rust: pci: use to_result() in enable_device_mem() MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers rust/kernel/faux: mark Registration methods inline driver core: faux: only create the device if probe() succeeds rust/faux: Add missing parent argument to Registration::new() rust/faux: Drop #[repr(transparent)] from faux::Registration rust: io: fix devres test with new io accessor functions rust: io: rename `io::Io` accessors kernfs: Move dput() outside of the RCU section. efi: rci2: mark bin_attribute as __ro_after_init rapidio: constify 'struct bin_attribute' firmware: qemu_fw_cfg: constify 'struct bin_attribute' powerpc/perf/hv-24x7: Constify 'struct bin_attribute' ...
2025-04-01cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lockShakeel Butt
The commit 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") during cleanup accidentally changed the code to call cgroup_rstat_updated_list() without cgroup_rstat_lock which is required. Fix it. Fixes: 093c8812de2d ("cgroup: rstat: Cleanup flushing functions and locking") Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/6564c3d6-9372-4352-9847-1eb3aea07ca4@linux.ibm.com/ Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename itWaiman Long
The goto statement in sched_partition_write() is not needed. Remove it and rename sched_partition_write()/sched_partition_show() to cpuset_partition_write()/cpuset_partition_show(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Code cleanup and comment updateWaiman Long
Rename partition_xcpus_newstate() to isolated_cpus_update(), update_partition_exclusive() to update_partition_exclusive_flag() and the new_xcpus_state variable to isolcpus_updated to make their meanings more explicit. Also add some comments to further clarify the code. No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Don't allow creation of local partition over a remote oneWaiman Long
Currently, we don't allow the creation of a remote partition underneath another local or remote partition. However, it is currently possible to create a new local partition with an existing remote partition underneath it if top_cpuset is the parent. However, the current cpuset code does not set the effective exclusive CPUs correctly to account for those that are taken by the remote partition. Changing the code to properly account for those remote partition CPUs under all possible circumstances can be complex. It is much easier to not allow such a configuration which is not that useful. So forbid that by making sure that exclusive_cpus mask doesn't overlap with subpartitions_cpus and invalidate the partition if that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() ↵Waiman Long
handle remote partition Currently, changes in exclusive CPUs are being handled in remote_partition_check() by disabling conflicting remote partitions. However, that may lead to results unexpected by the users. Fix this problem by removing remote_partition_check() and making update_cpumasks_hier() handle changes in descendant remote partitions properly. The compute_effective_exclusive_cpumask() function is enhanced to check the exclusive_cpus and effective_xcpus from siblings and excluded them in its effective exclusive CPUs computation and return a value to show if there is any sibling conflicts. This is somewhat like the cpu_exclusive flag check in validate_change(). This is the initial step to enable us to retire the use of cpu_exclusive flag in cgroup v2 in the future. One of the tests in the TEST_MATRIX of the test_cpuset_prs.sh script has to be updated due to changes in the way a child remote partition root is being handled (updated instead of invalidation) in update_cpumasks_hier(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Fix error handling in remote_partition_disable()Waiman Long
When remote_partition_disable() is called to disable a remote partition, it always sets the partition to an invalid partition state. It should only do so if an error code (prs_err) has been set. Correct that and add proper error code in places where remote_partition_disable() is called due to error. Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-31cgroup/cpuset: Fix incorrect isolated_cpus update in ↵Waiman Long
update_parent_effective_cpumask() Before commit f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes"), a cpuset partition cannot be enabled if not all the requested CPUs can be granted from the parent cpuset. After that commit, a cpuset partition can be created even if the requested exclusive CPUs contain CPUs not allowed its parent. The delmask containing exclusive CPUs to be removed from its parent wasn't adjusted accordingly. That is not a problem until the introduction of a new isolated_cpus mask in commit 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") as the CPUs in the delmask may be added directly into isolated_cpus. As a result, isolated_cpus may incorrectly contain CPUs that are not isolated leading to incorrect data reporting. Fix this by adjusting the delmask to reflect the actual exclusive CPUs for the creation of the partition. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'cgroup-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Add deprecation info messages to cgroup1-only features - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact - Other misc and doc updates * tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: Cleanup flushing functions and locking cgroup/rstat: avoid disabling irqs for O(num_cpu) mm: Fix a build breakage in memcontrol-v1.c blk-cgroup: Simplify policy files registration cgroup: Update file naming comment cgroup: Add deprecation message to legacy freezer controller mm: Add transformation message for per-memcg swappiness RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level cgroup/cpuset-v1: Add deprecation messages to memory_migrate cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall cgroup: Print message when /proc/cgroups is read on v2-only system cgroup/blkio: Add deprecation messages to reset_stats cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers cgroup/rstat: Fix forceidle time in cpu.stat cgroup/misc: Remove unused misc_cg_res_total_usage cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c cgroup: update comment about dropping cgroup kn refs
2025-03-20cgroup: rstat: Cleanup flushing functions and lockingYosry Ahmed
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unnecessary and unclear. Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move the lock/unlock calls to the beginning and ending of the loop body to make the critical section obvious. cgroup_rstat_flush_hold/release() do not make much sense with the lock being dropped and reacquired internally. Since it has no external callers, remove it and explicitly acquire the lock in cgroup_base_stat_cputime_show() instead. This leaves the code with a single flushing function, cgroup_rstat_flush(). Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-19cgroup/rstat: avoid disabling irqs for O(num_cpu)Eric Dumazet
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while iterating all possible cpus. It only drops the lock if there is scheduler or spin lock contention. If neither, then interrupts can be disabled for a long time. On large machines this can disable interrupts for a long enough time to drop network packets. On 400+ CPU machines I've seen interrupt disabled for over 40 msec. Prevent rstat from disabling interrupts while processing all possible cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This approach was previously discussed in https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, though this was in the context of an non-irq rstat spin lock. Benchmark this change with: 1) a single stat_reader process with 400 threads, each reading a test memcg's memory.stat repeatedly for 10 seconds. 2) 400 memory hog processes running in the test memcg and repeatedly charging memory until oom killed. Then they repeat charging and oom killing. v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds interrupts are disabled by rstat for 45341 usec: # => started at: _raw_spin_lock_irq # => ended at: cgroup_rstat_flush # # # _------=> CPU# # / _-----=> irqs-off/BH-disabled # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / _-=> migrate-disable # ||||| / delay # cmd pid |||||| time | caller # \ / |||||| \ | / stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq stat_rea-96532 52d.... 45342us : cgroup_rstat_flush stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush stat_rea-96532 52d.... 45343us : <stack trace> => memcg1_stat_format => memory_stat_format => memory_stat_show => seq_read_iter => vfs_read => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the longest holder. The longest irqs-off holder has irqs disabled for 4142 usec, a huge reduction from previous 45341 usec rstat finding. Running stat_reader memory.stat reader for 10 seconds: - without memory hogs: 9.84M accesses => 12.7M accesses - with memory hogs: 9.46M accesses => 11.1M accesses The throughput of memory.stat access improves. The mode of memory.stat access latency after grouping by of 2 buckets: - without memory hogs: 64 usec => 16 usec - with memory hogs: 64 usec => 8 usec The memory.stat latency improves. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Greg Thelen <gthelen@google.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-17cgroup/cpuset: Remove partition_and_rebuild_sched_domainsJuri Lelli
partition_and_rebuild_sched_domains() and partition_sched_domains() are now equivalent. Remove the former as a nice clean up. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MR4ryNDJZDzsSG@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Wrappers for sched_domains_mutexJuri Lelli
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-14KVM: TDX: Register TDX host key IDs to cgroup misc controllerZhiming Hu
TDX host key IDs (HKID) are limit resources in a machine, and the misc cgroup lets the machine owner track their usage and limits the possibility of abusing them outside the owner's control. The cgroup v2 miscellaneous subsystem was introduced to control the resource of AMD SEV & SEV-ES ASIDs. Likewise introduce HKIDs as a misc resource. Signed-off-by: Zhiming Hu <zhiming.hu@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-11blk-cgroup: Simplify policy files registrationMichal Koutný
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>