summaryrefslogtreecommitdiff
path: root/kernel/cgroup
AgeCommit message (Collapse)Author
2022-05-05cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()Waiman Long
There are 3 places where the cpu and node masks of the top cpuset can be initialized in the order they are executed: 1) start_kernel -> cpuset_init() 2) start_kernel -> cgroup_init() -> cpuset_bind() 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp() The first cpuset_init() call just sets all the bits in the masks. The second cpuset_bind() call sets cpus_allowed and mems_allowed to the default v2 values. The third cpuset_init_smp() call sets them back to v1 values. For systems with cgroup v2 setup, cpuset_bind() is called once. As a result, cpu and memory node hot add may fail to update the cpu and node masks of the top cpuset to include the newly added cpu or node in a cgroup v2 environment. For systems with cgroup v1 setup, cpuset_bind() is called again by rebind_subsystem() when the v1 cpuset filesystem is mounted as shown in the dmesg log below with an instrumented kernel. [ 2.609781] cpuset_bind() called - v2 = 1 [ 3.079473] cpuset_init_smp() called [ 7.103710] cpuset_bind() called - v2 = 0 smp_init() is called after the first two init functions. So we don't have a complete list of active cpus and memory nodes until later in cpuset_init_smp() which is the right time to set up effective_cpus and effective_mems. To fix this cgroup v2 mask setup problem, the potentially incorrect cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed. For cgroup v2 systems, the initial cpuset_bind() call will set the masks correctly. For cgroup v1 systems, the second call to cpuset_bind() will do the right setup. cc: stable@vger.kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-03-28Merge tag 'driver-core-5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ...
2022-03-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-23cgroup: use irqsave in cgroup_rstat_flush_locked().Sebastian Andrzej Siewior
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Link: https://lkml.kernel.org/r/20220301122143.1521823-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Subject: cgroup: add a comment to cgroup_rstat_flush_locked(). Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed. Link: https://lkml.kernel.org/r/Yh+DOK73hfVV5ThX@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23Merge branch 'for-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "All trivial cleanups without meaningful behavior changes" * 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: cleanup comments cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment cgroup: rstat: retrieve current bstat to delta directly cgroup: rstat: use same convention to assign cgroup_base_stat
2022-03-15Merge tag 'v5.17-rc8' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-13cgroup: cleanup commentsTom Rix
for spdx, add a space before // replacements judgement to judgment transofrmed to transformed partitition to partition histrical to historical migratecd to migrated Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-28Merge 5.17-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver core fix in here as well for future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23kernfs: move struct kernfs_root out of the public view.Greg Kroah-Hartman
There is no need to have struct kernfs_root be part of kernfs.h for the whole kernel to see and poke around it. Move it internal to kernfs code and provide a helper function, kernfs_root_to_node(), to handle the one field that kernfs users were directly accessing from the structure. Cc: Imran Khan <imran.f.khan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-22Merge branch 'for-5.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix for a subtle bug in the recent release_agent permission check update - Fix for a long-standing race condition between cpuset and cpu hotplug - Comment updates * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix kernel-doc cgroup-v1: Correct privileges check in release_agent writes cgroup: clarify cgroup_css_set_fork() cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
2022-02-22cpuset: Fix kernel-docJiapeng Chong
Fix the following W=1 kernel warnings: kernel/cgroup/cpuset.c:3718: warning: expecting prototype for cpuset_memory_pressure_bump(). Prototype was for __cpuset_memory_pressure_bump() instead. kernel/cgroup/cpuset.c:3568: warning: expecting prototype for cpuset_node_allowed(). Prototype was for __cpuset_node_allowed() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22cgroup-v1: Correct privileges check in release_agent writesMichal Koutný
The idea is to check: a) the owning user_ns of cgroup_ns, b) capabilities in init_user_ns. The commit 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") got this wrong in the write handler of release_agent since it checked user_ns of the opener (may be different from the owning user_ns of cgroup_ns). Secondly, to avoid possibly confused deputy, the capability of the opener must be checked. Fixes: 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Masami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22cgroup: clarify cgroup_css_set_fork()Christian Brauner
With recent fixes for the permission checking when moving a task into a cgroup using a file descriptor to a cgroup's cgroup.procs file and calling write() it seems a good idea to clarify CLONE_INTO_CGROUP permission checking with a comment. Cc: Tejun Heo <tj@kernel.org> Cc: <cgroups@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-21Merge tag 'v5.17-rc5' into sched/core, to resolve conflictsIngo Molnar
New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-16sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-14cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplugZhang Qiao
As previously discussed(https://lkml.org/lkml/2022/1/20/51), cpuset_attach() is affected with similar cpu hotplug race, as follow scenario: cpuset_attach() cpu hotplug --------------------------- ---------------------- down_write(cpuset_rwsem) guarantee_online_cpus() // (load cpus_attach) sched_cpu_deactivate set_cpu_active() // will change cpu_active_mask set_cpus_allowed_ptr(cpus_attach) __set_cpus_allowed_ptr_locked() // (if the intersection of cpus_attach and cpu_active_mask is empty, will return -EINVAL) up_write(cpuset_rwsem) To avoid races such as described above, protect cpuset_attach() call with cpu_hotplug_lock. Fixes: be367d099270 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time") Cc: stable@vger.kernel.org # v2.6.32+ Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-03Merge branch 'for-5.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Eric's fix for a long standing cgroup1 permission issue where it only checks for uid 0 instead of CAP which inadvertently allows unprivileged userns roots to modify release_agent userhelper - Fixes for the fallout from Waiman's recent cpuset work * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning cgroup-v1: Require capabilities to set release_agent cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask() cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
2022-02-03cgroup/cpuset: Fix "suspicious RCU usage" lockdep warningWaiman Long
It was found that a "suspicious RCU usage" lockdep warning was issued with the rcu_read_lock() call in update_sibling_cpumasks(). It is because the update_cpumasks_hier() function may sleep. So we have to release the RCU lock, call update_cpumasks_hier() and reacquire it afterward. Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks() instead of stating that in the comment. Fixes: 4716909cc5c5 ("cpuset: Track cpusets that use parent's effective_cpus") Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Phil Auld <pauld@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-01cgroup-v1: Require capabilities to set release_agentEric W. Biederman
The cgroup release_agent is called with call_usermodehelper. The function call_usermodehelper starts the release_agent with a full set fo capabilities. Therefore require capabilities when setting the release_agaent. Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com> Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com> Fixes: 81a6a5cdd2c5 ("Task Control Groups: automatic userspace notification of idle cgroups") Cc: stable@vger.kernel.org # v2.6.24+ Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-26cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()Tianchen Ding
subparts_cpus should be limited as a subset of cpus_allowed, but it is updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to fix it. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-18psi: Fix uaf issue when psi trigger is destroyed while being polledSuren Baghdasaryan
With write operation on psi files replacing old trigger with a new one, the lifetime of its waitqueue is totally arbitrary. Overwriting an existing trigger causes its waitqueue to be freed and pending poll() will stumble on trigger->event_wait which was destroyed. Fix this by disallowing to redefine an existing psi trigger. If a write operation is used on a file descriptor with an already existing psi trigger, the operation will fail with EBUSY error. Also bypass a check for psi_disabled in the psi_trigger_destroy as the flag can be flipped after the trigger is created, leading to a memory leak. Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Analyzed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
2022-01-12cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchyMichal Koutný
The commit 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy") inteded to relax the check only on the default hierarchy (or v2 mode) but it dropped the check in v1 too. This patch returns and separates the legacy-only validations so that they can be considered only in the v1 mode, which should enforce the old constraints for the sake of compatibility. Fixes: 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy") Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-12cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc commentYang Li
Add the description of @kargs in cgroup_can_fork() and cgroup_post_fork() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. kernel/cgroup/cgroup.c:6235: warning: Function parameter or member 'kargs' not described in 'cgroup_can_fork' kernel/cgroup/cgroup.c:6296: warning: Function parameter or member 'kargs' not described in 'cgroup_post_fork' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-12cgroup: rstat: retrieve current bstat to delta directlyWei Yang
Instead of retrieve current bstat to cur and copy it to delta, let's use delta directly. This saves one copy operation and has the same code convention as propagating delta to parent. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-12cgroup: rstat: use same convention to assign cgroup_base_statWei Yang
In function cgroup_base_stat_flush(), we update cgroup_base_stat by getting rstatc->bstat and adjust delta to related fields. There are two convention to assign cgroup_base_stat in this function: * rstat2 = rstat1 * rstat2.cputime = rstat1.cputime The second convention may make audience think just field "cputime" is updated, while cputime is the only field in cgroup_base_stat. Let's use the same convention to eliminate this confusion. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-11Merge branch 'for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. The only two noticeable changes are a subtle cpuset behavior fix and trace event id field being expanded to u64 from int. Most others are code cleanups" * 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean cgroup/rstat: check updated_next only for root cgroup: rstat: explicitly put loop variant in while cgroup: return early if it is already on preloaded list cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy cgroup: Trace event cgroup id fields should be u64 cgroup: fix a typo in comment cgroup: get the wrong css for css_alloc() during cgroup_init_subsys() cgroup: rstat: Mark benign data race to silence KCSAN
2022-01-10Merge tag '5.17-net-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Defer freeing TCP skbs to the BH handler, whenever possible, or at least perform the freeing outside of the socket lock section to decrease cross-CPU allocator work and improve latency. - Add netdevice refcount tracking to locate sources of netdevice and net namespace refcount leaks. - Make Tx watchdog less intrusive - avoid pausing Tx and restarting all queues from a single CPU removing latency spikes. - Various small optimizations throughout the stack from Eric Dumazet. - Make netdev->dev_addr[] constant, force modifications to go via appropriate helpers to allow us to keep addresses in ordered data structures. - Replace unix_table_lock with per-hash locks, improving performance of bind() calls. - Extend skb drop tracepoint with a drop reason. - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW. BPF --- - New helpers: - bpf_find_vma(), find and inspect VMAs for profiling use cases - bpf_loop(), runtime-bounded loop helper trading some execution time for much faster (if at all converging) verification - bpf_strncmp(), improve performance, avoid compiler flakiness - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt() for tracing programs, all inlined by the verifier - Support BPF relocations (CO-RE) in the kernel loader. - Further the support for BTF_TYPE_TAG annotations. - Allow access to local storage in sleepable helpers. - Convert verifier argument types to a composable form with different attributes which can be shared across types (ro, maybe-null). - Prepare libbpf for upcoming v1.0 release by cleaning up APIs, creating new, extensible ones where missing and deprecating those to be removed. Protocols --------- - WiFi (mac80211/cfg80211): - notify user space about long "come back in N" AP responses, allow it to react to such temporary rejections - allow non-standard VHT MCS 10/11 rates - use coarse time in airtime fairness code to save CPU cycles - Bluetooth: - rework of HCI command execution serialization to use a common queue and work struct, and improve handling errors reported in the middle of a batch of commands - rework HCI event handling to use skb_pull_data, avoiding packet parsing pitfalls - support AOSP Bluetooth Quality Report - SMC: - support net namespaces, following the RDMA model - improve connection establishment latency by pre-clearing buffers - introduce TCP ULP for automatic redirection to SMC - Multi-Path TCP: - support ioctls: SIOCINQ, OUTQ, and OUTQNSD - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT, IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY - support cmsgs: TCP_INQ - improvements in the data scheduler (assigning data to subflows) - support fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP) - MCTP (Management Component Transport) over serial, as defined by DMTF spec DSP0253 - "MCTP Serial Transport Binding". Driver API ---------- - Support timestamping on bond interfaces in active/passive mode. - Introduce generic phylink link mode validation for drivers which don't have any quirks and where MAC capability bits fully express what's supported. Allow PCS layer to participate in the validation. Convert a number of drivers. - Add support to set/get size of buffers on the Rx rings and size of the tx copybreak buffer via ethtool. - Support offloading TC actions as first-class citizens rather than only as attributes of filters, improve sharing and device resource utilization. - WiFi (mac80211/cfg80211): - support forwarding offload (ndo_fill_forward_path) - support for background radar detection hardware - SA Query Procedures offload on the AP side New hardware / drivers ---------------------- - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with real-time requirements for isochronous communication with protocols like OPC UA Pub/Sub. - Qualcomm BAM-DMUX WWAN - driver for data channels of modems integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974 (qcom_bam_dmux). - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver with support for bridging, VLANs and multicast forwarding (lan966x). - iwlmei driver for co-operating between Intel's WiFi driver and Intel's Active Management Technology (AMT) devices. - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips - Bluetooth: - MediaTek MT7921 SDIO devices - Foxconn MT7922A - Realtek RTL8852AE Drivers ------- - Significantly improve performance in the datapaths of: lan78xx, ax88179_178a, lantiq_xrx200, bnxt. - Intel Ethernet NICs: - igb: support PTP/time PEROUT and EXTTS SDP functions on 82580/i354/i350 adapters - ixgbevf: new PF -> VF mailbox API which avoids the risk of mailbox corruption with ESXi - iavf: support configuration of VLAN features of finer granularity, stacked tags and filtering - ice: PTP support for new E822 devices with sub-ns precision - ice: support firmware activation without reboot - Mellanox Ethernet NICs (mlx5): - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - support TC forwarding when tunnel encap and decap happen between two ports of the same NIC - dynamically size and allow disabling various features to save resources for running in embedded / SmartNIC scenarios - Broadcom Ethernet NICs (bnxt): - use page frag allocator to improve Rx performance - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - Other Ethernet NICs: - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support - Microsoft cloud/virtual NIC (mana): - add XDP support (PASS, DROP, TX) - Mellanox Ethernet switches (mlxsw): - initial support for Spectrum-4 ASICs - VxLAN with IPv6 underlay - Marvell Ethernet switches (prestera): - support flower flow templates - add basic IP forwarding support - NXP embedded Ethernet switches (ocelot & felix): - support Per-Stream Filtering and Policing (PSFP) - enable cut-through forwarding between ports by default - support FDMA to improve packet Rx/Tx to CPU - Other embedded switches: - hellcreek: improve trapping management (STP and PTP) packets - qca8k: support link aggregation and port mirroring - Qualcomm 802.11ax WiFi (ath11k): - qca6390, wcn6855: enable 802.11 power save mode in station mode - BSS color change support - WCN6855 hw2.1 support - 11d scan offload support - scan MAC address randomization support - full monitor mode, only supported on QCN9074 - qca6390/wcn6855: report signal and tx bitrate - qca6390: rfkill support - qca6390/wcn6855: regdb.bin support - Intel WiFi (iwlwifi): - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS) in cooperation with the BIOS - support for Optimized Connectivity Experience (OCE) scan - support firmware API version 68 - lots of preparatory work for the upcoming Bz device family - MediaTek WiFi (mt76): - Specific Absorption Rate (SAR) support - mt7921: 160 MHz channel support - RealTek WiFi (rtw88): - Specific Absorption Rate (SAR) support - scan offload - Other WiFi NICs - ath10k: support fetching (pre-)calibration data from nvmem - brcmfmac: configure keep-alive packet on suspend - wcn36xx: beacon filter support" * tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits) tcp: tcp_send_challenge_ack delete useless param `skb` net/qla3xxx: Remove useless DMA-32 fallback configuration rocker: Remove useless DMA-32 fallback configuration hinic: Remove useless DMA-32 fallback configuration lan743x: Remove useless DMA-32 fallback configuration net: enetc: Remove useless DMA-32 fallback configuration cxgb4vf: Remove useless DMA-32 fallback configuration cxgb4: Remove useless DMA-32 fallback configuration cxgb3: Remove useless DMA-32 fallback configuration bnx2x: Remove useless DMA-32 fallback configuration et131x: Remove useless DMA-32 fallback configuration be2net: Remove useless DMA-32 fallback configuration vmxnet3: Remove useless DMA-32 fallback configuration bna: Simplify DMA setting net: alteon: Simplify DMA setting myri10ge: Simplify DMA setting qlcnic: Simplify DMA setting net: allwinner: Fix print format page_pool: remove spinlock in page_pool_refill_alloc_cache() amt: fix wrong return type of amt_send_membership_update() ...
2022-01-07cpuset: convert 'allowed' in __cpuset_node_allowed() to be booleanQi Zheng
Convert 'allowed' in __cpuset_node_allowed() to be boolean since the return types of node_isset() and __cpuset_node_allowed() are both boolean. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup/rstat: check updated_next only for rootWei Yang
After commit dc26532aed0a ("cgroup: rstat: punt root-level optimization to individual controllers"), each rstat on updated_children list has its ->updated_next not NULL. This means we can remove the check on ->updated_next, if we make sure the subtree from @root is on list, which could be done by checking updated_next for root. tj: Coding style fixes. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: rstat: explicitly put loop variant in whileWei Yang
Instead of do while unconditionally, let's put the loop variant in while. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: Use open-time cgroup namespace for process migration perm checksTejun Heo
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's cgroup namespace which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes cgroup remember the cgroup namespace at the time of open and uses it for migration permission checks instad of current's. Note that this only applies to cgroup2 as cgroup1 doesn't have namespace support. This also fixes a use-after-free bug on cgroupns reported in https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Note that backporting this fix also requires the preceding patch. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-06cgroup: Allocate cgroup_file_ctx for kernfs_open_file->privTejun Heo
of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michal Koutný <mkoutny@suse.com>
2022-01-06cgroup: Use open-time credentials for process migraton perm checksTejun Heo
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's credentials which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes both cgroup2 and cgroup1 process migration interfaces to use the credentials saved at the time of open (file->f_cred) instead of current's. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-16add missing bpf-cgroup.h includesJakub Kicinski
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency, make sure those who actually need more than the definition of struct cgroup_bpf include bpf-cgroup.h explicitly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
2021-12-14cgroup: return early if it is already on preloaded listWei Yang
If a cset is already on preloaded list, this means we have already setup this cset properly for migration. This patch just relocates the root cgrp lookup which isn't used anyway when the cset is already on the preloaded list. [tj@kernel.org: rephrase the commit log] Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-13cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchyWaiman Long
In validate_change(), there is a check since v2.6.12 to make sure that each of the child cpusets must be a subset of a parent cpuset. IOW, it allows child cpusets to restrict what changes can be made to a parent's "cpuset.cpus". This actually violates one of the core principles of the default hierarchy where a cgroup higher up in the hierarchy should be able to change configuration however it sees fit as deligation breaks down otherwise. To address this issue, the check is now removed for the default hierarchy to free parent cpusets from being restricted by child cpusets. The check will still apply for legacy hierarchy. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-29cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()Wei Yang
css_alloc() needs the parent css, while cgroup_css() gets current cgropu's css. So we are getting the wrong css during cgroup_init_subsys(). Fortunately, cgrp_dfl_root.cgrp's css is not set yet, so the value we pass to css_alloc() is NULL anyway. Let's pass NULL directly during init, since we know there is no parent yet. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-15cgroup: rstat: Mark benign data race to silence KCSANMichal Koutný
There is a race between updaters and flushers (flush can possibly miss the latest update(s)). This is expected as explained in cgroup_rstat_updated() comment, add also machine readable annotation so that KCSAN results aren't noisy. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/r/CACkBjsbPVdkub=e-E-p1WBOLxS515ith-53SFdmFHWV_QMo40w@mail.gmail.com Suggested-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06mm/page_alloc: detect allocation forbidden by cpuset and bail out earlyFeng Tang
There was a report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (where node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is that in this case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check cost in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handling it inside OOM code, adding cpuset check, refining comments] Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-02Merge branch 'for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - The misc controller now reports allocation rejections through misc.events instead of printking - cgroup_mutex usage is reduced to improve scalability of some operations - vhost helper threads are now assigned to the right cgroup on cgroup2 - Bug fixes * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c cgroup: Fix rootcg cpu.stat guest double counting cgroup: no need for cgroup_mutex for /proc/cgroups cgroup: remove cgroup_mutex from cgroupstats_build cgroup: reduce dependency on cgroup_mutex cgroup: cgroup-v1: do not exclude cgrp_dfl_root cgroup: Make rebind_subsystems() disable v2 controllers all at once docs/cgroup: add entry for misc.events misc_cgroup: remove error log to avoid log flood misc_cgroup: introduce misc.events to count failures
2021-11-01cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.cHe Fengqing
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf") cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only used in kernel/bpf/cgroup.c, so move these function to kernel/bpf/cgroup.c, like cgroup_bpf_replace(). Signed-off-by: He Fengqing <hefengqing@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-01cgroup: Fix rootcg cpu.stat guest double countingDan Schatzberg
In account_guest_time in kernel/sched/cputime.c guest time is attributed to both CPUTIME_NICE and CPUTIME_USER in addition to CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding both to calculate usage results in double counting any guest time at the rootcg. Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup") Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-25cgroup: no need for cgroup_mutex for /proc/cgroupsShakeel Butt
On the real systems, the cgroups hierarchies are setup early and just once by the node controller, so, other than number of cgroups, all information in /proc/cgroups remain same for the system uptime. Let's remove the cgroup_mutex usage on reading /proc/cgroups. There is a chance of inconsistent number of cgroups for co-mounted cgroups while printing the information from /proc/cgroups but that is not a big issue. In addition /proc/cgroups is a v1 specific interface, so the dependency on it should reduce over time. The main motivation for removing the cgroup_mutex from /proc/cgroups is to reduce the avenues of its contention. On our fleet, we have observed buggy application hammering on /proc/cgroups and drastically slowing down the node controller on the system which have many negative consequences on other workloads running on the system. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-25cgroup: remove cgroup_mutex from cgroupstats_buildShakeel Butt
The function cgroupstats_build extracts cgroup from the kernfs_node's priv pointer which is a RCU pointer. So, there is no need to grab cgroup_mutex. Just get the reference on the cgroup before using and remove the cgroup_mutex altogether. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-25cgroup: reduce dependency on cgroup_mutexShakeel Butt
Currently cgroup_get_from_path() and cgroup_get_from_id() grab cgroup_mutex before traversing the default hierarchy to find the kernfs_node corresponding to the path/id and then extract the linked cgroup. Since cgroup_mutex is still held, it is guaranteed that the cgroup will be alive and the reference can be taken on it. However similar guarantee can be provided without depending on the cgroup_mutex and potentially reducing avenues of cgroup_mutex contentions. The kernfs_node's priv pointer is RCU protected pointer and with just rcu read lock we can grab the reference on the cgroup without cgroup_mutex. So, remove cgroup_mutex from them. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-22cgroup: Fix memory leak caused by missing cgroup_bpf_offlineQuanyang Wang
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running the command as below: $mount -t cgroup -o none,name=foo cgroup cgroup/ $umount cgroup/ unreferenced object 0xc3585c40 (size 64): comm "mount", pid 425, jiffies 4294959825 (age 31.990s) hex dump (first 32 bytes): 01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(......... 00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC...... backtrace: [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c [<1f03679c>] cgroup_setup_root+0x174/0x37c [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0 [<f85b12fd>] vfs_get_tree+0x24/0x108 [<f55aec5c>] path_mount+0x384/0x988 [<e2d5e9cd>] do_mount+0x64/0x9c [<208c9cfe>] sys_mount+0xfc/0x1f4 [<06dd06e0>] ret_fast_syscall+0x0/0x48 [<a8308cb3>] 0xbeb4daa8 This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data is allocated by the function percpu_ref_init in cgroup_bpf_inherit which is called by cgroup_setup_root when mounting, but not freed along with root_cgrp when umounting. Adding cgroup_bpf_offline which calls percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in umount path. This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a cleanup that frees the resources which are allocated by cgroup_bpf_inherit in cgroup_setup_root. And inside cgroup_bpf_offline, cgroup_get() is at the beginning and cgroup_put is at the end of cgroup_bpf_release which is called by cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of cgroup's refcount. Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com
2021-10-11Merge branch 'for-5.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "All documentation / comment updates" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroupv2, docs: fix misinformation in "device controller" section cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem docs/cgroup: remove some duplicate words
2021-10-05cgroup: cgroup-v1: do not exclude cgrp_dfl_rootVishal Verma
Found an issue within cgroup_attach_task_all() fn which seem to exclude cgrp_dfl_root (cgroupv2) while attaching tasks to the given cgroup. This was noticed when the system was running qemu/kvm with kernel vhost helper threads. It appears that the vhost layer which uses cgroup_attach_task_all() fn to assign the vhost kthread to the right qemu cgroup works fine with cgroupv1 based configuration but not in cgroupv2. With cgroupv2, the vhost helper thread ends up just belonging to the root cgroup as is shown below: $ stat -fc %T /sys/fs/cgroup/ cgroup2fs $ sudo pgrep qemu 1916421 $ ps -eL | grep 1916421 1916421 1916421 ? 00:00:01 qemu-system-x86 1916421 1916431 ? 00:00:00 call_rcu 1916421 1916435 ? 00:00:00 IO mon_iothread 1916421 1916436 ? 00:00:34 CPU 0/KVM 1916421 1916439 ? 00:00:00 SPICE Worker 1916421 1916440 ? 00:00:00 vnc_worker 1916433 1916433 ? 00:00:00 vhost-1916421 1916437 1916437 ? 00:00:00 kvm-pit/1916421 $ cat /proc/1916421/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator $ cat /proc/1916439/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator $ cat /proc/1916433/cgroup 0::/ From above, it can be seen that the vhost kthread (PID: 1916433) doesn't seem to belong the qemu cgroup like other qemu PIDs. After applying this patch: $ pgrep qemu 1643 $ ps -eL | grep 1643 1643 1643 ? 00:00:00 qemu-system-x86 1643 1645 ? 00:00:00 call_rcu 1643 1648 ? 00:00:00 IO mon_iothread 1643 1649 ? 00:00:00 CPU 0/KVM 1643 1652 ? 00:00:00 SPICE Worker 1643 1653 ? 00:00:00 vnc_worker 1647 1647 ? 00:00:00 vhost-1643 1651 1651 ? 00:00:00 kvm-pit/1643 $ cat /proc/1647/cgroup 0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-09-28bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interruptDaniel Borkmann
If cgroup_sk_alloc() is called from interrupt context, then just assign the root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather than re-adding the NULL-test to the fast-path we can just assign it once from cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp directly does /not/ change behavior for callers of sock_cgroup_ptr(). syzkaller was able to trigger a splat in the legacy netrom code base, where the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc() and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects a non-NULL object. There are a few other candidates aside from netrom which have similar pattern where in their accept-like implementation, they just call to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the corresponding cgroup_sk_clone() which then inherits the cgroup from the parent socket. None of them are related to core protocols where BPF cgroup programs are running from. However, in future, they should follow to implement a similar inheritance mechanism. Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID configuration, the same issue was exposed also prior to 8520e224f547 due to commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") which added the early in_interrupt() return back then. Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net