summaryrefslogtreecommitdiff
path: root/kernel/cgroup/cgroup-v1.c
AgeCommit message (Collapse)Author
2024-01-18Merge tag 'driver-core-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing major in here this release cycle, just lots of small cleanups and some tweaks on kernfs that in the very end, got reverted and will come back in a safer way next release cycle. Included in here are: - more driver core 'const' cleanups and fixes - fw_devlink=rpm is now the default behavior - kernfs tiny changes to remove some string functions - cpu handling in the driver core is updated to work better on many systems that add topologies and cpus after booting - other minor changes and cleanups All of the cpu handling patches have been acked by the respective maintainers and are coming in here in one series. Everything has been in linux-next for a while with no reported issues" * tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits) Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock" kernfs: convert kernfs_idr_lock to an irq safe raw spinlock class: fix use-after-free in class_register() PM: clk: make pm_clk_add_notifier() take a const pointer EDAC: constantify the struct bus_type usage kernfs: fix reference to renamed function driver core: device.h: fix Excess kernel-doc description warning driver core: class: fix Excess kernel-doc description warning driver core: mark remaining local bus_type variables as const driver core: container: make container_subsys const driver core: bus: constantify subsys_register() calls driver core: bus: make bus_sort_breadthfirst() take a const pointer kernfs: d_obtain_alias(NULL) will do the right thing... driver core: Better advertise dev_err_probe() kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy() initramfs: Expose retained initrd as sysfs file fs/kernfs/dir: obey S_ISGID kernel/cgroup: use kernfs_create_dir_ns() ...
2023-12-15kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()Kees Cook
One of the last remaining users of strlcpy() in the kernel is kernfs_path_from_node_locked(), which passes back the problematic "length we _would_ have copied" return value to indicate truncation. Convert the chain of all callers to use the negative return value (some of which already doing this explicitly). All callers were already also checking for negative return values, so the risk to missed checks looks very low. In this analysis, it was found that cgroup1_release_agent() actually didn't handle the "too large" condition, so this is technically also a bug fix. :) Here's the chain of callers, and resolution identifying each one as now handling the correct return value: kernfs_path_from_node_locked() kernfs_path_from_node() pr_cont_kernfs_path() returns void kernfs_path() sysfs_warn_dup() return value ignored cgroup_path() blkg_path() bfq_bic_update_cgroup() return value ignored TRACE_IOCG_PATH() return value ignored TRACE_CGROUP_PATH() return value ignored perf_event_cgroup() return value ignored task_group_path() return value ignored damon_sysfs_memcg_path_eq() return value ignored get_mm_memcg_path() return value ignored lru_gen_seq_show() return value ignored cgroup_path_from_kernfs_id() return value ignored cgroup_show_path() already converted "too large" error to negative value cgroup_path_ns_locked() cgroup_path_ns() bpf_iter_cgroup_show_fdinfo() return value ignored cgroup1_release_agent() wasn't checking "too large" error proc_cgroup_show() already converted "too large" to negative value Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: <cgroups@vger.kernel.org> Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-09cgroup: Add a new helper for cgroup1 hierarchyYafang Shao
A new helper is added for cgroup1 hierarchy: - task_get_cgroup1 Acquires the associated cgroup of a task within a specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID. This helper function is added to facilitate the tracing of tasks within a particular container or cgroup dir in BPF programs. It's important to note that this helper is designed specifically for cgroup1 only. tj: Use irsqsave/restore as suggested by Hou Tao <houtao@huaweicloud.com>. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Hou Tao <houtao@huaweicloud.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-09cgroup: Remove duplicates in cgroup v1 tasks fileMichal Koutný
One PID may appear multiple times in a preloaded pidlist. (Possibly due to PID recycling but we have reports of the same task_struct appearing with different PIDs, thus possibly involving transfer of PID via de_thread().) Because v1 seq_file iterator uses PIDs as position, it leads to a message: > seq_file: buggy .next function kernfs_seq_next did not update position index Conservative and quick fix consists of removing duplicates from `tasks` file (as opposed to removing pidlists altogether). It doesn't affect correctness (it's sufficient to show a PID once), performance impact would be hidden by unconditional sorting of the pidlist already in place (asymptotically). Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/ Suggested-by: Firo Yang <firo.yang@suse.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2023-08-07cgroup: clean up if condition in cgroup_pidlist_start()Miaohe Lin
There's no need to use '<=' when knowing 'l->list[mid] != pid' already. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-27Merge tag 'cgroup-for-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Whenever cpuset needs to rebuild sched_domain, it walked all tasks looking for DEADLINE tasks as they need to be accounted on the new domain. Walking all tasks can be expensive and there may not be any DEADLINE tasks at all. Task iteration is now omitted if there are no DEADLINE tasks - Fixes DEADLINE bandwidth misaccounting after task migration failures - When no controller is enabled, -Wstringop-overflow warning is triggered. The fix patch added an early exit which is too eager and got reverted for now. Will fix later - Everything else is minor cleanups * tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: Avoid -Wstringop-overflow warnings" cgroup/misc: Expose misc.current on cgroup v2 root cgroup: Avoid -Wstringop-overflow warnings cgroup: remove obsolete comment on cgroup_on_dfl() cgroup: remove unused task_cgroup_path() cgroup/cpuset: remove unneeded header files cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static rdmacg: fix kernel-doc warnings in rdmacg cgroup: Replace the css_set call with cgroup_get cgroup: remove unused macro for_each_e_css() cgroup: Update out-of-date comment in cgroup_migrate() cgroup: Replace all non-returning strlcpy with strscpy cgroup/cpuset: remove unneeded header files cgroup/cpuset: Free DL BW in case can_attach() fails sched/deadline: Create DL BW alloc, free & check overflow interface cgroup/cpuset: Iterate only if DEADLINE tasks are present sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets sched/cpuset: Bring back cpuset_mutex cgroup/cpuset: Rename functions dealing with DEADLINE accounting
2023-05-22cgroup: Replace all non-returning strlcpy with strscpyAzeem Shaikh
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()Qi Zheng
The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappersKamalesh Babulal
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26cgroup: Use cgroup_attach_{lock,unlock}() from cgroup_attach_task_all()Tetsuo Handa
No behavior changes; preparing for potential locking changes in future. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by:Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-25cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()Tetsuo Handa
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that cpuset_attach() is also called from cgroup_attach_task_all(). Add cpus_read_lock() like what cgroup_procs_write_start() does. Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1] Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-23cgroup: Make !percpu threadgroup_rwsem operations optionalTejun Heo
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem because the impiled synchronize_rcu() on write locking was pushing up the latencies too much for android which constantly moves processes between cgroups. This makes the hotter paths - fork and exit - slower as they're always forced into the slow path. There is no reason to force this on everyone especially given that more common static usage pattern can now completely avoid write-locking the rwsem. Write-locking is elided when turning on and off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a cgroup without grabbing the rwsem. Restore the default percpu operations and introduce the mount option "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need lower latencies for the dynamic operations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutn� <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Oleg Nesterov <oleg@redhat.com>
2022-02-22cgroup-v1: Correct privileges check in release_agent writesMichal Koutný
The idea is to check: a) the owning user_ns of cgroup_ns, b) capabilities in init_user_ns. The commit 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") got this wrong in the write handler of release_agent since it checked user_ns of the opener (may be different from the owning user_ns of cgroup_ns). Secondly, to avoid possibly confused deputy, the capability of the opener must be checked. Fixes: 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Masami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-01cgroup-v1: Require capabilities to set release_agentEric W. Biederman
The cgroup release_agent is called with call_usermodehelper. The function call_usermodehelper starts the release_agent with a full set fo capabilities. Therefore require capabilities when setting the release_agaent. Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com> Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com> Fixes: 81a6a5cdd2c5 ("Task Control Groups: automatic userspace notification of idle cgroups") Cc: stable@vger.kernel.org # v2.6.24+ Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: Allocate cgroup_file_ctx for kernfs_open_file->privTejun Heo
of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michal Koutný <mkoutny@suse.com>
2022-01-06cgroup: Use open-time credentials for process migraton perm checksTejun Heo
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's credentials which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes both cgroup2 and cgroup1 process migration interfaces to use the credentials saved at the time of open (file->f_cred) instead of current's. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-25cgroup: no need for cgroup_mutex for /proc/cgroupsShakeel Butt
On the real systems, the cgroups hierarchies are setup early and just once by the node controller, so, other than number of cgroups, all information in /proc/cgroups remain same for the system uptime. Let's remove the cgroup_mutex usage on reading /proc/cgroups. There is a chance of inconsistent number of cgroups for co-mounted cgroups while printing the information from /proc/cgroups but that is not a big issue. In addition /proc/cgroups is a v1 specific interface, so the dependency on it should reduce over time. The main motivation for removing the cgroup_mutex from /proc/cgroups is to reduce the avenues of its contention. On our fleet, we have observed buggy application hammering on /proc/cgroups and drastically slowing down the node controller on the system which have many negative consequences on other workloads running on the system. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-25cgroup: remove cgroup_mutex from cgroupstats_buildShakeel Butt
The function cgroupstats_build extracts cgroup from the kernfs_node's priv pointer which is a RCU pointer. So, there is no need to grab cgroup_mutex. Just get the reference on the cgroup before using and remove the cgroup_mutex altogether. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-05cgroup: cgroup-v1: do not exclude cgrp_dfl_rootVishal Verma
Found an issue within cgroup_attach_task_all() fn which seem to exclude cgrp_dfl_root (cgroupv2) while attaching tasks to the given cgroup. This was noticed when the system was running qemu/kvm with kernel vhost helper threads. It appears that the vhost layer which uses cgroup_attach_task_all() fn to assign the vhost kthread to the right qemu cgroup works fine with cgroupv1 based configuration but not in cgroupv2. With cgroupv2, the vhost helper thread ends up just belonging to the root cgroup as is shown below: $ stat -fc %T /sys/fs/cgroup/ cgroup2fs $ sudo pgrep qemu 1916421 $ ps -eL | grep 1916421 1916421 1916421 ? 00:00:01 qemu-system-x86 1916421 1916431 ? 00:00:00 call_rcu 1916421 1916435 ? 00:00:00 IO mon_iothread 1916421 1916436 ? 00:00:34 CPU 0/KVM 1916421 1916439 ? 00:00:00 SPICE Worker 1916421 1916440 ? 00:00:00 vnc_worker 1916433 1916433 ? 00:00:00 vhost-1916421 1916437 1916437 ? 00:00:00 kvm-pit/1916421 $ cat /proc/1916421/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator $ cat /proc/1916439/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator $ cat /proc/1916433/cgroup 0::/ From above, it can be seen that the vhost kthread (PID: 1916433) doesn't seem to belong the qemu cgroup like other qemu PIDs. After applying this patch: $ pgrep qemu 1643 $ ps -eL | grep 1643 1643 1643 ? 00:00:00 qemu-system-x86 1643 1645 ? 00:00:00 call_rcu 1643 1648 ? 00:00:00 IO mon_iothread 1643 1649 ? 00:00:00 CPU 0/KVM 1643 1652 ? 00:00:00 SPICE Worker 1643 1653 ? 00:00:00 vnc_worker 1647 1647 ? 00:00:00 vhost-1643 1651 1651 ? 00:00:00 kvm-pit/1643 $ cat /proc/1647/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-31Merge branch 'for-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Two cpuset behavior changes: - cpuset on cgroup2 is changed to enable memory migration based on nodemask by default. - A notification is generated when cpuset partition state changes. All other patches are minor fixes and cleanups" * 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Avoid compiler warnings with no subsystems cgroup/cpuset: Avoid memory migration when nodemasks match cgroup/cpuset: Enable memory migration for cpuset v2 cgroup/cpuset: Enable event notification when partition state changes cgroup: cgroup-v1: clean up kernel-doc notation cgroup: Replace deprecated CPU-hotplug functions. cgroup/cpuset: Fix violation of cpuset locking rule cgroup/cpuset: Fix a partition bug with hotplug cgroup/cpuset: Miscellaneous code cleanup cgroup: remove cgroup_mount from comments
2021-08-11cgroup: cgroup-v1: clean up kernel-doc notationRandy Dunlap
Fix kernel-doc warnings found in cgroup-v1.c: kernel/cgroup/cgroup-v1.c:55: warning: No description found for return value of 'cgroup_attach_task_all' kernel/cgroup/cgroup-v1.c:94: warning: expecting prototype for cgroup_trasnsfer_tasks(). Prototype was for cgroup_transfer_tasks() instead cgroup-v1.c:96: warning: No description found for return value of 'cgroup_transfer_tasks' kernel/cgroup/cgroup-v1.c:687: warning: No description found for return value of 'cgroupstats_build' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2021-07-21cgroup1: fix leaked context root causing sporadic NULL deref in LTPPaul Gortmaker
Richard reported sporadic (roughly one in 10 or so) null dereferences and other strange behaviour for a set of automated LTP tests. Things like: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014 RIP: 0010:kernfs_sop_show_path+0x1b/0x60 ...or these others: RIP: 0010:do_mkdirat+0x6a/0xf0 RIP: 0010:d_alloc_parallel+0x98/0x510 RIP: 0010:do_readlinkat+0x86/0x120 There were other less common instances of some kind of a general scribble but the common theme was mount and cgroup and a dubious dentry triggering the NULL dereference. I was only able to reproduce it under qemu by replicating Richard's setup as closely as possible - I never did get it to happen on bare metal, even while keeping everything else the same. In commit 71d883c37e8d ("cgroup_do_mount(): massage calling conventions") we see this as a part of the overall change: -------------- struct cgroup_subsys *ss; - struct dentry *dentry; [...] - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, - CGROUP_SUPER_MAGIC, ns); [...] - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { - struct super_block *sb = dentry->d_sb; - dput(dentry); + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block *sb = fc->root->d_sb; + dput(fc->root); deactivate_locked_super(sb); msleep(10); return restart_syscall(); } -------------- In changing from the local "*dentry" variable to using fc->root, we now export/leave that dentry pointer in the file context after doing the dput() in the unlikely "is_dying" case. With LTP doing a crazy amount of back to back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely becomes slightly likely and then bad things happen. A fix would be to not leave the stale reference in fc->root as follows: --------------                 dput(fc->root); + fc->root = NULL;                 deactivate_locked_super(sb); -------------- ...but then we are just open-coding a duplicate of fc_drop_locked() so we simply use that instead. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org # v5.1+ Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org> Fixes: 71d883c37e8d ("cgroup_do_mount(): massage calling conventions") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-07-14fs: add vfs_parse_fs_param_source() helperChristian Brauner
Add a simple helper that filesystems can use in their parameter parser to parse the "source" parameter. A few places open-coded this function and that already caused a bug in the cgroup v1 parser that we fixed. Let's make it harder to get this wrong by introducing a helper which performs all necessary checks. Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12 Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14cgroup: verify that source is a stringChristian Brauner
The following sequence can be used to trigger a UAF: int fscontext_fd = fsopen("cgroup"); int fd_null = open("/dev/null, O_RDONLY); int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null); close_range(3, ~0U, 0); The cgroup v1 specific fs parser expects a string for the "source" parameter. However, it is perfectly legitimate to e.g. specify a file descriptor for the "source" parameter. The fs parser doesn't know what a filesystem allows there. So it's a bug to assume that "source" is always of type fs_value_is_string when it can reasonably also be fs_value_is_file. This assumption in the cgroup code causes a UAF because struct fs_parameter uses a union for the actual value. Access to that union is guarded by the param->type member. Since the cgroup paramter parser didn't check param->type but unconditionally moved param->string into fc->source a close on the fscontext_fd would trigger a UAF during put_fs_context() which frees fc->source thereby freeing the file stashed in param->file causing a UAF during a close of the fd_null. Fix this by verifying that param->type is actually a string and report an error if not. In follow up patches I'll add a new generic helper that can be used here and by other filesystems instead of this error-prone copy-pasta fix. But fixing it in here first makes backporting a it to stable a lot easier. Fixes: 8d2451f4994f ("cgroup1: switch to option-by-option parsing") Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@kernel.org> Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-18sched: Change task_struct::statePeter Zijlstra
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-10cgroup1: don't allow '\n' in renamingAlexander Kuznetsov
cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Reported-by: Andrey Krasichkov <buglloc@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2021-05-24cgroup: fix spelling mistakesZhen Lei
Fix some spelling mistakes in comments: hierarhcy ==> hierarchy automtically ==> automatically overriden ==> overridden In absense of .. or ==> In absence of .. and assocaited ==> associated taget ==> target initate ==> initiate succeded ==> succeeded curremt ==> current udpated ==> updated Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-16cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()Chunguang Xu
If delayacct is disabled, then delayacct_is_task_waiting_on_io() always returns false, which causes the statistical value to be wrong. Perhaps tsk->in_iowait is better. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-01-15cgroup-v1: add disabled controller check in cgroup1_parse_param()Chen Zhou
When mounting a cgroup hierarchy with disabled controller in cgroup v1, all available controllers will be attached. For example, boot with cgroup_no_v1=cpu or cgroup_disable=cpu, and then mount with "mount -t cgroup -ocpu cpu /sys/fs/cgroup/cpu", then all enabled controllers will be attached except cpu. Fix this by adding disabled controller check in cgroup1_parse_param(). If the specified controller is disabled, just return error with information "Disabled controller xx" rather than attaching all the other enabled controllers. Fixes: f5dfb5315d34 ("cgroup: take options parsing into ->parse_monolithic()") Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Reviewed-by: Zefan Li <lizefan.x@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-12-16cgroup: Fix memory leak when parsing multiple source parametersQinglang Miao
A memory leak is found in cgroup1_parse_param() when multiple source parameters overwrite fc->source in the fs_context struct without free. unreferenced object 0xffff888100d930e0 (size 16): comm "mount", pid 520, jiffies 4303326831 (age 152.783s) hex dump (first 16 bytes): 74 65 73 74 6c 65 61 6b 00 00 00 00 00 00 00 00 testleak........ backtrace: [<000000003e5023ec>] kmemdup_nul+0x2d/0xa0 [<00000000377dbdaa>] vfs_parse_fs_string+0xc0/0x150 [<00000000cb2b4882>] generic_parse_monolithic+0x15a/0x1d0 [<000000000f750198>] path_mount+0xee1/0x1820 [<0000000004756de2>] do_mount+0xea/0x100 [<0000000094cafb0a>] __x64_sys_mount+0x14b/0x1f0 Fix this bug by permitting a single source parameter and rejecting with an error all subsequent ones. Fixes: 8d2451f4994f ("cgroup1: switch to option-by-option parsing") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Reviewed-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-12cgroup: Restructure release_agent_path handlingTejun Heo
cgrp->root->release_agent_path is protected by both cgroup_mutex and release_agent_path_lock and readers can hold either one. The dual-locking scheme was introduced while breaking a locking dependency issue around cgroup_mutex but doesn't make sense anymore given that the only remaining reader which uses cgroup_mutex is cgroup1_releaes_agent(). This patch updates cgroup1_release_agent() to use release_agent_path_lock so that release_agent_path is always protected only by release_agent_path_lock. While at it, convert strlen() based empty string checks to direct tests on the first character as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-04cgroup1: don't call release_agent when it is ""Tycho Andersen
Older (and maybe current) versions of systemd set release_agent to "" when shutting down, but do not set notify_on_release to 0. Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()"), we filter out such calls when the user mode helper path is "". However, when used in conjunction with an actual (i.e. non "") STATIC_USERMODEHELPER, the path is never "", so the real usermode helper will be called with argv[0] == "". Let's avoid this by not invoking the release_agent when it is "". Signed-off-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12cgroup-v1: cgroup_pidlist_next should update position indexVasily Averin
if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. # mount | grep cgroup # dd if=/mnt/cgroup.procs bs=1 # normal output ... 1294 1295 1296 1304 1382 584+0 records in 584+0 records out 584 bytes copied dd: /mnt/cgroup.procs: cannot skip to specified offset 83 <<< generates end of last line 1383 <<< ... and whole last line once again 0+1 records in 0+1 records out 8 bytes copied dd: /mnt/cgroup.procs: cannot skip to specified offset 1386 <<< generates last line anyway 0+1 records in 0+1 records out 5 bytes copied https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-07cgroup1: switch to use of errorfc() et.al.Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parse: fold fs_parameter_desc/fs_parameter_specAl Viro
The former contains nothing but a pointer to an array of the latter... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parser: remove fs_parameter_description name fieldEric Sandeen
Unused now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07get rid of cg_invalf()Al Viro
pointless alias for invalf()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-07cgroup: Optimize single thread migrationMichal Koutný
There are reports of users who use thread migrations between cgroups and they report performance drop after d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"). The effect is pronounced on machines with more CPUs. The migration is affected by forking noise happening in the background, after the mentioned commit a migrating thread must wait for all (forking) processes on the system, not only of its threadgroup. There are several places that need to synchronize with migration: a) do_exit, b) de_thread, c) copy_process, d) cgroup_update_dfl_csses, e) parallel migration (cgroup_{proc,thread}s_write). In the case of self-migrating thread, we relax the synchronization on cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are excluded with cgroup_mutex, c) does not matter in case of single thread migration and the executing thread cannot exec(2) or exit(2) while it is writing into cgroup.threads. In case of do_exit because of signal delivery, we either exit before the migration or finish the migration (of not yet PF_EXITING thread) and die afterwards. This patch handles only the case of self-migration by writing "0" into cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem with numeric PIDs. This change improves migration dependent workload performance similar to per-signal_struct state. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-08-07Use kvmalloc in cgroups-v1Marc Koderer
Instead of using its own logic for k-/vmalloc rely on kvmalloc which is actually doing quite the same. Signed-off-by: Marc Koderer <marc@koderer.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-19cgroup: implement __cgroup_task_count() helperRoman Gushchin
The helper is identical to the existing cgroup_task_count() except it doesn't take the css_set_lock by itself, assuming that the caller does. Also, move cgroup_task_count() implementation into kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com
2019-02-28vfs: Add some logging to the core users of the fs_context logDavid Howells
Add some logging to the core users of the fs_context log so that information can be extracted from them as to the reason for failure. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: store a reference to cgroup_ns into cgroup_fs_contextAl Viro
... and trim cgroup_do_mount() arguments (renaming it to cgroup_do_get_tree()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helperAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup_do_mount(): massage calling conventionsAl Viro
pass it fs_context instead of fs_type/flags/root triple, have it return int instead of dentry and make it deal with setting fc->root. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: stash cgroup_root reference into cgroup_fs_contextAl Viro
Note that this reference is *NOT* contributing to refcount of cgroup_root in question and is valid only until cgroup_do_mount() returns. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup1: switch to option-by-option parsingAl Viro
[dhowells should be the author - it's carved out of his patch] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: take options parsing into ->parse_monolithic()Al Viro
Store the results in cgroup_fs_context. There's a nasty twist caused by the enabling/disabling subsystems - we can't do the checks sensitive to that until cgroup_mutex gets grabbed. Frankly, these checks are complete bullshit (e.g. all,none combination is accepted if all subsystems are disabled; so's cpusets,none and all,cpusets when cpusets is disabled, etc.), but touching that would be a userland-visible behaviour change ;-/ So we do parsing in ->parse_monolithic() and have the consistency checks done in check_cgroupfs_options(), with the latter called (on already parsed options) from cgroup1_get_tree() and cgroup1_reconfigure(). Freeing the strdup'ed strings is done from fs_context destructor, which somewhat simplifies the life for cgroup1_{get_tree,reconfigure}(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: fold cgroup1_mount() into cgroup1_get_tree()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: start switching to fs_contextAl Viro
Unfortunately, cgroup is tangled into kernfs infrastructure. To avoid converting all kernfs-based filesystems at once, we need to untangle the remount part of things, instead of having it go through kernfs_sop_remount_fs(). Fortunately, it's not hard to do. This commit just gets cgroup/cgroup1 to use fs_context to deliver options on mount and remount paths. Parsing those is going to be done in the next commits; for now we do pretty much what legacy case does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>