summaryrefslogtreecommitdiff
path: root/kernel/cgroup/cpuset.c
AgeCommit message (Collapse)Author
2025-01-08cgroup/cpuset: remove kernfs active breakChen Ridong
A warning was found: WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828 CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0 RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202 RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04 RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180 R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08 R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0 FS: 00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kernfs_drain+0x15e/0x2f0 __kernfs_remove+0x165/0x300 kernfs_remove_by_name_ns+0x7b/0xc0 cgroup_rm_file+0x154/0x1c0 cgroup_addrm_files+0x1c2/0x1f0 css_clear_dir+0x77/0x110 kill_css+0x4c/0x1b0 cgroup_destroy_locked+0x194/0x380 cgroup_rmdir+0x2a/0x140 It can be explained by: rmdir echo 1 > cpuset.cpus kernfs_fop_write_iter // active=0 cgroup_rm_file kernfs_remove_by_name_ns kernfs_get_active // active=1 __kernfs_remove // active=0x80000002 kernfs_drain cpuset_write_resmask wait_event //waiting (active == 0x80000001) kernfs_break_active_protection // active = 0x80000001 // continue kernfs_unbreak_active_protection // active = 0x80000002 ... kernfs_should_drain_open_files // warning occurs kernfs_put_active This warning is caused by 'kernfs_break_active_protection' when it is writing to cpuset.cpus, and the cgroup is removed concurrently. The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change involves calling flush_work(), which can create a multiple processes circular locking dependency that involve cgroup_mutex, potentially leading to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") added 'kernfs_break_active_protection' in the cpuset_write_resmask. This could lead to this warning. After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), the cpuset_write_resmask no longer needs to wait the hotplug to finish, which means that concurrent hotplug and cpuset operations are no longer possible. Therefore, the deadlock doesn't exist anymore and it does not have to 'break active protection' now. To fix this warning, just remove kernfs_break_active_protection operation in the 'cpuset_write_resmask'. Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively") Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") Reported-by: Ji Fa <jifa@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-11cgroup/cpuset: Prevent leakage of isolated CPUs into sched domainsWaiman Long
Isolated CPUs are not allowed to be used in a non-isolated partition. The only exception is the top cpuset which is allowed to contain boot time isolated CPUs. Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") introduces a simplified scheme of including only partition roots in sched domain generation. However, it does not properly account for this exception case. This can result in leakage of isolated CPUs into a sched domain. Fix it by making sure that isolated CPUs are excluded from the top cpuset before generating sched domains. Also update the way the boot time isolated CPUs are handled in test_cpuset_prs.sh to make sure that those isolated CPUs are really isolated instead of just skipping them in the tests. Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-10cgroup/cpuset: Remove stale textCosta Shulyupin
Task's cpuset pointer was removed by commit 8793d854edbc ("Task Control Groups: make cpusets a client of cgroups") Paragraph "The task_lock() exception ...." was removed by commit 2df167a300d7 ("cgroups: update comments in cpuset.c") Remove stale text: We also require taking task_lock() when dereferencing a task's cpuset pointer. See "The task_lock() exception", at the end of this comment. Accessing a task's cpuset should be done in accordance with the guidelines for accessing subsystem state in kernel/cgroup.c and reformat. Co-developed-by: Michal Koutný <mkoutny@suse.com> Co-developed-by: Waiman Long <longman@redhat.com> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancingWaiman Long
With some recent proposed changes [1] in the deadline server code, it has caused a test failure in test_cpuset_prs.sh when a change is being made to an isolated partition. This is due to failing the cpuset_cpumask_can_shrink() check for SCHED_DEADLINE tasks at validate_change(). This is actually a false positive as the failed test case involves an isolated partition with load balancing disabled. The deadline check is not meaningful in this case and the users should know what they are doing. Fix this by doing the cpuset_cpumask_can_shrink() check only when loading balanced is enabled. Also change its arguments to use effective_cpus for the current cpuset and user_xcpus() as an approiximation for the target effective_cpus as the real effective_cpus hasn't been fully computed yet as this early stage. As the check isn't comprehensive, there may be false positives or negatives. We may have to revise the code to do a more thorough check in the future if this becomes a concern. [1] https://lore.kernel.org/lkml/82be06c1-6d6d-4651-86c9-bcc828cbcb80@redhat.com/T/#t Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not setWaiman Long
Currently the cpuset code uses group_subsys_on_dfl() to check if we are running with cgroup v2. If CONFIG_CPUSETS_V1 isn't set, there is really no need to do this check and we can optimize out some of the unneeded v1 specific code paths. Introduce a new cpuset_v2() and use it to replace the cgroup_subsys_on_dfl() check to further optimize the code. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per ↵Waiman Long
operation Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug"), there is only one rebuild_sched_domains_locked() call per hotplug operation. However, writing to the various cpuset control files may still casue more than one rebuild_sched_domains_locked() call to happen in some cases. Juri had found that two rebuild_sched_domains_locked() calls in update_prstate(), one from update_cpumasks_hier() and another one from update_partition_sd_lb() could cause cpuset partition to be created with null total_bw for DL tasks. IOW, DL tasks may not be scheduled correctly in such a partition. A sample command sequence that can reproduce null total_bw is as follows. # echo Y >/sys/kernel/debug/sched/verbose # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control # mkdir /sys/fs/cgroup/test # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition Fix this double rebuild_sched_domains_locked() calls problem by replacing existing calls with cpuset_force_rebuild() except the rebuild_sched_domains_cpuslocked() call at the end of cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is now done at the end of cpuset_write_resmask() and update_prstate() to determine if rebuild_sched_domains_locked() should be called or not. The cpuset v1 code can still call rebuild_sched_domains_locked() directly as double rebuild_sched_domains_locked() calls is not possible. Reported-by: Juri Lelli <juri.lelli@redhat.com> Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/ Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in ↵Waiman Long
update_cpumasks_hier()" Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier()") to allow for an alternative way to suppress unnecessary rebuild_sched_domains_locked() calls in update_cpumasks_hier() and elsewhere in a following commit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-30cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.ceverestkc
Corrected the spelling errors repoted by codespell as follows: temparary ==> temporary Proprogate ==> Propagate constrainted ==> constrained Signed-off-by: Everest K.C. <everestkc@everestkc.com.np> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04cgroup/cpuset: Move cpu.h include to cpuset-internal.hWaiman Long
The newly created cpuset-v1.c file uses cpus_read_lock/unlock() functions which are defined in cpu.h but not included in cpuset-internal.h yet leading to compilation error under certain kernel configurations. Fix it by moving the cpu.h include from cpuset.c to cpuset-internal.h. While at it, sort the include files in alphabetic order. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408311612.mQTuO946-lkp@intel.com/ Fixes: 047b83097448 ("cgroup/cpuset: move relax_domain_level to cpuset-v1.c") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1Chen Ridong
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that user who adopted v2 don't have 'pay' for cpuset v1. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: rename functions shared between v1 and v2Chen Ridong
Some functions name declared in cpuset-internel.h are generic. To avoid confilicting with other variables for the same name, rename these functions with cpuset_/cpuset1_ prefix to make them unique to cpuset. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move v1 interfaces to cpuset-v1.cChen Ridong
Move legacy cpuset controller interfaces files and corresponding code into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and 'cpuset_common_seq_show' are also used for v1, so declare them in cpuset-internal.h. 'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used cpuset-v1.c now, make it static. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move validate_change_legacy to cpuset-v1.cChen Ridong
The validate_change_legacy functions is used for v1, move it to cpuset-v1.c. And two micro 'cpuset_for_each_child' and 'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to cpuset-internal.h. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move legacy hotplug update to cpuset-v1.cChen Ridong
There are some differents about hotplug update between cpuset v1 and cpuset v2. Move the legacy code to cpuset-v1.c. 'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset v1 and cpuset v2, declare them in cpuset-internal.h. The change from original code is that use callback_lock helpers to get callback_lock lock/unlock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: add callback_lock helperChen Ridong
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add helpers for cpuset-v1 to get callback_lock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move memory_spread to cpuset-v1.cChen Ridong
'memory_spread' is only set in cpuset v1. move corresponding code into cpuset-v1.c. Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are exposed to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move relax_domain_level to cpuset-v1.cChen Ridong
Setting domain level is not supported at cpuset v2, so move corresponding code into cpuset-v1.c. The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c. After cpuset legacy interface files are move to cpuset-v1.c, they can be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c. The change from original code is that using 'cpuset_lock' and 'cpuset_unlock' functions to lock or unlock cpuset_mutex. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move memory_pressure to cpuset-v1.cChen Ridong
Collection of memory_pressure can be enabled by writing 1 to the cpuset file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore, move the corresponding code to cpuset-v1.c. Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called at cpuset.c, so expose them to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move common code to cpuset-internal.hChen Ridong
Move some declarations that will be used for cpuset v1 and v2, including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc. No logical change. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: Account for boot time isolated CPUsWaiman Long
With the "isolcpus" boot command line parameter, we are able to create isolated CPUs at boot time. These isolated CPUs aren't fully accounted for in the cpuset code. For instance, the root cgroup's "cpuset.cpus.isolated" control file does not include the boot time isolated CPUs. Fix that by looking for pre-isolated CPUs at init time. The prstate_housekeeping_conflict() function does check the HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it can only be used in isolated partition. Given the fact that we are going to make housekeeping cpumasks dynamic, the current check may not be right anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: remove use_parent_ecpus of cpusetChen Ridong
use_parent_ecpus is used to track whether the children are using the parent's effective_cpus. When a parent's effective_cpus is changed due to changes in a child partition's effective_xcpus, any child using parent'effective_cpus must call update_cpumasks_hier. However, if a child is not a valid partition, it is sufficient to determine whether to call update_cpumasks_hier based on whether the child's effective_cpus is going to change. To make the code more succinct, it is suggested to remove use_parent_ecpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: remove fetch_xcpusChen Ridong
Both fetch_xcpus and user_xcpus functions are used to retrieve the value of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the implicit value used as exclusive in a local partition. I can not imagine a scenario where effective_xcpus is not empty when exclusive_cpus is empty. Therefore, I suggest removing the fetch_xcpus function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: Correct invalid remote parition prsChen Ridong
When enable a remote partition, I found that: cd /sys/fs/cgroup/ mkdir test mkdir test/test1 echo +cpuset > cgroup.subtree_control echo +cpuset > test/cgroup.subtree_control echo 3 > test/test1/cpuset.cpus echo root > test/test1/cpuset.cpus.partition cat test/test1/cpuset.cpus.partition root invalid (Parent is not a partition root) The parent of a remote partition could not be a root. This is due to the emtpy effective_xcpus. It would be better to prompt the message "invalid cpu list in cpuset.cpus.exclusive". Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: Check for partition roots with overlapping CPUsWaiman Long
With the previous commit that eliminates the overlapping partition root corner cases in the hotplug code, the partition roots passed down to generate_sched_domains() should not have overlapping CPUs. Enable overlapping cpuset check for v2 and warn if that happens. This patch also has the benefit of increasing test coverage of the new Union-Find cpuset merging code to cgroup v2. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05Merge branch 'cgroup/for-6.11-fixes' into cgroup/for-6.12Tejun Heo
cgroup/for-6.12 is about to receive updates that are dependent on changes from both for-6.11-fixes and for-6.12. Pull in for-6.11-fixes. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplugWaiman Long
It was found that some hotplug operations may cause multiple rebuild_sched_domains_locked() calls. Some of those intermediate calls may use cpuset states not in the final correct form leading to incorrect sched domain setting. Fix this problem by using the existing force_rebuild flag to inhibit immediate rebuild_sched_domains_locked() calls if set and only doing one final call at the end. Also renaming the force_rebuild flag to force_sd_rebuild to make its meaning for clear. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if ↵Waiman Long
cpus.exclusive not set Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") adds a user writable cpuset.cpus.exclusive file for setting exclusive CPUs to be used for the creation of partitions. Since then effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend only on cpuset.cpus.exclusive. When it is not set, effective_xcpus will be set according to the cpuset.cpus value when the cpuset becomes a valid partition root. When cpuset.cpus is being cleared by the user, effective_xcpus should only be cleared when cpuset.cpus.exclusive is not set. However, that is not currently the case. # cd /sys/fs/cgroup/ # mkdir test # echo +cpuset > cgroup.subtree_control # cd test # echo 3 > cpuset.cpus.exclusive # cat cpuset.cpus.exclusive.effective 3 # echo > cpuset.cpus # cat cpuset.cpus.exclusive.effective // was cleared Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is not set. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Cc: stable@vger.kernel.org # v6.7+ Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: fix panic caused by partcmd_updateChen Ridong
We find a bug as below: BUG: unable to handle page fault for address: 00000003 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 358 Comm: bash Tainted: G W I 6.6.0-10893-g60d6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4 RIP: 0010:partition_sched_domains_locked+0x483/0x600 Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9 RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202 RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80 RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002 R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0 Call Trace: <TASK> ? show_regs+0x8c/0xa0 ? __die_body+0x23/0xa0 ? __die+0x3a/0x50 ? page_fault_oops+0x1d2/0x5c0 ? partition_sched_domains_locked+0x483/0x600 ? search_module_extables+0x2a/0xb0 ? search_exception_tables+0x67/0x90 ? kernelmode_fixup_or_oops+0x144/0x1b0 ? __bad_area_nosemaphore+0x211/0x360 ? up_read+0x3b/0x50 ? bad_area_nosemaphore+0x1a/0x30 ? exc_page_fault+0x890/0xd90 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? asm_exc_page_fault+0x26/0x30 ? partition_sched_domains_locked+0x483/0x600 ? partition_sched_domains_locked+0xf0/0x600 rebuild_sched_domains_locked+0x806/0xdc0 update_partition_sd_lb+0x118/0x130 cpuset_write_resmask+0xffc/0x1420 cgroup_file_write+0xb2/0x290 kernfs_fop_write_iter+0x194/0x290 new_sync_write+0xeb/0x160 vfs_write+0x16f/0x1d0 ksys_write+0x81/0x180 __x64_sys_write+0x21/0x30 x64_sys_call+0x2f25/0x4630 do_syscall_64+0x44/0xb0 entry_SYSCALL_64_after_hwframe+0x78/0xe2 RIP: 0033:0x7f44a553c887 It can be reproduced with cammands: cd /sys/fs/cgroup/ mkdir test cd test/ echo +cpuset > ../cgroup.subtree_control echo root > cpuset.cpus.partition cat /sys/fs/cgroup/cpuset.cpus.effective 0-3 echo 0-3 > cpuset.cpus // taking away all cpus from root This issue is caused by the incorrect rebuilding of scheduling domains. In this scenario, test/cpuset.cpus.partition should be an invalid root and should not trigger the rebuilding of scheduling domains. When calling update_parent_effective_cpumask with partcmd_update, if newmask is not null, it should recheck newmask whether there are cpus is available for parect/cs that has tasks. Fixes: 0c7f293efc87 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30cpuset: use Union-Find to optimize the merging of cpumasksXavier
The process of constructing scheduling domains involves multiple loops and repeated evaluations, leading to numerous redundant and ineffective assessments that impact code efficiency. Here, we use union-find to optimize the merging of cpumasks. By employing path compression and union by rank, we effectively reduce the number of lookups and merge comparisons. Signed-off-by: Xavier <xavier_qy@163.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30cgroup/cpuset: add decrease attach_in_progress helpersChen Ridong
There are several functions to decrease attach_in_progress, and they will wake up cpuset_attach_wq when attach_in_progress is zero. So, add a helper to make it concise. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30cgroup/cpuset: Remove cpuset_slab_spread_rotorXiu Jianfeng
Since the SLAB implementation was removed in v6.8, so the cpuset_slab_spread_rotor is no longer used and can be removed. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30cgroup/cpuset: remove child_ecpus_countChen Ridong
The child_ecpus_count variable was previously used to update sibling cpumask when parent's effective_cpus is updated. However, it became obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2"). It should be removed. tj: Restored {} for style consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-14Merge branch 'for-6.10-fixes' into for-6.11Tejun Heo
2024-06-28cgroup/cpuset: Prevent UAF in proc_cpuset_show()Chen Ridong
An UAF can happen when /proc/cpuset is read as reported in [1]. This can be reproduced by the following methods: 1.add an mdelay(1000) before acquiring the cgroup_lock In the cgroup_path_ns function. 2.$cat /proc/<pid>/cpuset repeatly. 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/ $umount /sys/fs/cgroup/cpuset/ repeatly. The race that cause this bug can be shown as below: (umount) | (cat /proc/<pid>/cpuset) css_release | proc_cpuset_show css_release_work_fn | css = task_get_css(tsk, cpuset_cgrp_id); css_free_rwork_fn | cgroup_path_ns(css->cgroup, ...); cgroup_destroy_root | mutex_lock(&cgroup_mutex); rebind_subsystems | cgroup_free_root | | // cgrp was freed, UAF | cgroup_path_ns_locked(cgrp,..); When the cpuset is initialized, the root node top_cpuset.css.cgrp will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated &cgroup_root.cgrp. When the umount operation is executed, top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp. The problem is that when rebinding to cgrp_dfl_root, there are cases where the cgroup_root allocated by setting up the root for cgroup v1 is cached. This could lead to a Use-After-Free (UAF) if it is subsequently freed. The descendant cgroups of cgroup v1 can only be freed after the css is released. However, the css of the root will never be released, yet the cgroup_root should be freed when it is unmounted. This means that obtaining a reference to the css of the root does not guarantee that css.cgrp->root will not be freed. Fix this problem by using rcu_read_lock in proc_cpuset_show(). As cgroup_root is kfree_rcu after commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe"), css->cgroup won't be freed during the critical section. To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to replace task_get_css with task_css. [1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpusWaiman Long
The "cpuset.cpus.exclusive.effective" value is currently limited to a subset of its "cpuset.cpus". This makes the exclusive CPUs distribution hierarchy subsumed within the larger "cpuset.cpus" hierarchy. We have to decide on what CPUs are used locally and what CPUs can be passed down as exclusive CPUs down the hierarchy and combine them into "cpuset.cpus". The advantage of the current scheme is to have only one hierarchy to worry about. However, it make it harder to use as all the "cpuset.cpus" values have to be properly set along the way down to the designated remote partition root. It also makes it more cumbersome to find out what CPUs can be used locally. Make creation of remote partition simpler by breaking the dependency of "cpuset.cpus.exclusive" on "cpuset.cpus" and make them independent entities. Now we have two separate hierarchies - one for setting "cpuset.cpus.effective" and the other one for setting "cpuset.cpus.exclusive.effective". We may not need to set "cpuset.cpus" when we activate a partition root anymore. Also update Documentation/admin-guide/cgroup-v2.rst and cpuset.c comment to document this change. Suggested-by: Petr Malat <oss@malat.biz> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partitionWaiman Long
The CS_CPU_EXCLUSIVE flag is currently set whenever cpuset.cpus.exclusive is set to make sure that the exclusivity test will be run to ensure its exclusiveness. At the same time, this flag can be changed whenever the partition root state is changed. For example, the CS_CPU_EXCLUSIVE flag will be reset whenever a partition root becomes invalid. This makes using CS_CPU_EXCLUSIVE to ensure exclusiveness a bit fragile. The current scheme also makes setting up a cpuset.cpus.exclusive hierarchy to enable remote partition harder as cpuset.cpus.exclusive cannot overlap with any cpuset.cpus of sibling cpusets if their cpuset.cpus.exclusive aren't set. Solve these issues by deferring the setting of CS_CPU_EXCLUSIVE flag until the cpuset become a valid partition root while adding new checks in validate_change() to ensure that cpuset.cpus.exclusive of sibling cpusets cannot overlap. An additional check is also added to validate_change() to make sure that cpuset.cpus of one cpuset cannot be a subset of cpuset.cpus.exclusive of a sibling cpuset to avoid the problem that none of those CPUs will be available when these exclusive CPUs are extracted out to a newly enabled partition root. The Documentation/admin-guide/cgroup-v2.rst file is updated to document the new constraints. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-19cgroup/cpuset: Fix remote root partition creation problemWaiman Long
Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"), a remote partition can be created underneath a non-partition root cpuset as long as its exclusive_cpus are set to distribute exclusive CPUs down to its children. The generate_sched_domains() function, however, doesn't take into account this new behavior and hence will fail to create the sched domain needed for a remote root (non-isolated) partition. There are two issues related to remote partition support. First of all, generate_sched_domains() has a fast path that is activated if root_load_balance is true and top_cpuset.nr_subparts is non-zero. The later condition isn't quite correct for remote partitions as nr_subparts just shows the number of local child partitions underneath it. There can be no local child partition under top_cpuset even if there are remote partitions further down the hierarchy. Fix that by checking for subpartitions_cpus which contains exclusive CPUs allocated to both local and remote partitions. Secondly, the valid partition check for subtree skipping in the csa[] generation loop isn't enough as remote partition does not need to have a partition root parent. Fix this problem by breaking csa[] array generation loop of generate_sched_domains() into v1 and v2 specific parts and checking a cpuset's exclusive_cpus before skipping its subtree in the v2 case. Also simplify generate_sched_domains() for cgroup v2 as only non-isolating partition roots should be included in building the cpuset array and none of the v1 scheduling attributes other than a different way to create an isolated partition are supported. Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-03cgroup/cpuset: Optimize isolated partition only generate_sched_domains() callsWaiman Long
If only isolated partitions are being created underneath the cgroup root, there will only be one sched domain with top_cpuset.effective_cpus. We can skip the unnecessary sched domains scanning code and save some cycles. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-01cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCEXiu Jianfeng
In the cpuset_css_online(), clearing the CS_SCHED_LOAD_BALANCE bit of cs->flags is guarded by callback_lock and cpuset_mutex. There is no problem with itself, because it is consistent with the description of there two global lock at the beginning of this file. However, since the operation of checking, setting and clearing the flag bit is atomic, protection of callback_lock is unnecessary here, see CS_SPREAD_*. so to make it more consistent with the other code, move the operation outside the critical section of callback_lock. No functional changes intended. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26cgroup/cpuset: Update comment on callback_lockXiu Jianfeng
Since commit 51ffe41178c4 ("cpuset: convert away from cftype->read()"), cpuset_common_file_read() has been renamed. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-26cgroup/cpuset: Remove unnecessary zeroingXiu Jianfeng
The struct cpuset is kzalloc'd, all the members are zeroed already, so don't need nodes_clear() here. No functional changes intended. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-19Merge tag 'sched-urgent-2024-05-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a sched_balance_newidle setting bug - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst - Fix variable-shadowing build warning - Extend sched-domains debug output - Fix documentation - Fix comments * tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() sched/fair: Remove stale FREQUENCY_UTIL comment sched/fair: Fix initial util_avg calculation docs: cgroup-v1: Clarify that domain levels are system-specific sched/debug: Dump domains' level sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level arch/topology: Fix variable naming to avoid shadowing
2024-05-17sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_levelVitalii Bursov
Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
2024-04-25cgroup/cpuset: Remove outdated comment in sched_partition_write()Xiu Jianfeng
The comment here is outdated and can cause confusion, from the code perspective, there’s also no need for new comment, so just remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23cgroup/cpuset: Fix incorrect top_cpuset flagsWaiman Long
Commit 8996f93fc388 ("cgroup/cpuset: Statically initialize more members of top_cpuset") uses an incorrect "<" relational operator for the CS_SCHED_LOAD_BALANCE bit when initializing the top_cpuset. This results in load_balancing turned off by default in the top cpuset which is bad for performance. Fix this by using the BIT() helper macro to set the desired top_cpuset flags and avoid similar mistake from being made in the future. Fixes: 8996f93fc388 ("cgroup/cpuset: Statically initialize more members of top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twiceXiu Jianfeng
In cpuset_css_online(), CS_SCHED_LOAD_BALANCE will be cleared twice, the former one in the is_in_v2_mode() case could be removed because is_in_v2_mode() can be true for cgroup v1 if the "cpuset_v2_mode" mount option is specified, that balance flag change isn't appropriate for this particular case. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-22cgroup/cpuset: Statically initialize more members of top_cpusetXiu Jianfeng
Initializing top_cpuset.relax_domain_level and setting CS_SCHED_LOAD_BALANCE to top_cpuset.flags in cpuset_init() could be completed at the time of top_cpuset definition by compiler. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-08cgroup/cpuset: Make cpuset hotplug processing synchronousWaiman Long
Since commit 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()"), cpuset hotplug was done asynchronously via a work function. This is to avoid recursive locking of cgroup_mutex. Since then, the cgroup locking scheme has changed quite a bit. A cpuset_mutex was introduced to protect cpuset specific operations. The cpuset_mutex is then replaced by a cpuset_rwsem. With commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), cpu_hotplug_lock is acquired before cpuset_rwsem. Later on, cpuset_rwsem is reverted back to cpuset_mutex. All these locking changes allow the hotplug code to call into cpuset core directly. The following commits were also merged due to the asynchronous nature of cpuset hotplug processing. - commit b22afcdf04c9 ("cpu/hotplug: Cure the cpusets trainwreck") - commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") - commit 28b89b9e6f7b ("cpuset: handle race between CPU hotplug and cpuset_hotplug_work") Clean up all these bandages by making cpuset hotplug processing synchronous again with the exception that the call to cgroup_transfer_tasks() to transfer tasks out of an empty cgroup v1 cpuset, if necessary, will still be done via a work function due to the existing cgroup_mutex -> cpu_hotplug_lock dependency. It is possible to reverse that dependency, but that will require updating a number of different cgroup controllers. This special hotplug code path should be rarely taken anyway. As all the cpuset states will be updated by the end of the hotplug operation, we can revert most the above commits except commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") which is partially reverted. Also removing some cpus_read_lock trylock attempts in the cpuset partition code as they are no longer necessary since the cpu_hotplug_lock is now held for the whole duration of the cpuset hotplug code path. Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-11Merge tag 'cgroup-for-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A quiet cycle. One trivial doc update patch. Two patches to drop the now defunct memory_spread_slab feature from cgroup1 cpuset" * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Mark memory_spread_slab as obsolete cgroup/cpuset: Remove cpuset_do_slab_mem_spread() docs: cgroup-v1: add missing code-block tags
2024-02-29cgroup/cpuset: Fix retval in update_cpumask()Kamalesh Babulal
The update_cpumask(), checks for newly requested cpumask by calling validate_change(), which returns an error on passing an invalid set of cpu(s). Independent of the error returned, update_cpumask() always returns zero, suppressing the error and returning success to the user on writing an invalid cpu range for a cpuset. Fix it by returning retval instead, which is returned by validate_change(). Fixes: 99fe36ba6fc1 ("cgroup/cpuset: Improve temporary cpumasks handling") Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>