summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-07-07signal handling: don't use BUG_ON() for debuggingLinus Torvalds
These are indeed "should not happen" situations, but it turns out recent changes made the 'task_is_stopped_or_trace()' case trigger (fix for that exists, is pending more testing), and the BUG_ON() makes it unnecessarily hard to actually debug for no good reason. It's been that way for a long time, but let's make it clear: BUG_ON() is not good for debugging, and should never be used in situations where you could just say "this shouldn't happen, but we can continue". Use WARN_ON_ONCE() instead to make sure it gets logged, and then just continue running. Instead of making the system basically unusuable because you crashed the machine while potentially holding some very core locks (eg this function is commonly called while holding 'tasklist_lock' for writing). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-07genirq: Return a const cpumask from irq_data_get_affinity_maskSamuel Holland
Now that the irq_data_update_affinity helper exists, enforce its use by returning a a const cpumask from irq_data_get_affinity_mask. Since the previous commit already updated places that needed to call irq_data_update_affinity, this commit updates the remaining code that either did not modify the cpumask or immediately passed the modified mask to irq_set_affinity. Signed-off-by: Samuel Holland <samuel@sholland.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220701200056.46555-8-samuel@sholland.org
2022-07-07genirq: Drop redundant irq_init_effective_affinitySamuel Holland
It does exactly the same thing as irq_data_update_effective_affinity. Signed-off-by: Samuel Holland <samuel@sholland.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220701200056.46555-5-samuel@sholland.org
2022-07-07genirq: GENERIC_IRQ_EFFECTIVE_AFF_MASK depends on SMPSamuel Holland
An IRQ's effective affinity can only be different from its configured affinity if there are multiple CPUs. Make it clear that this option is only meaningful when SMP is enabled. Most of the relevant code in irqdesc.c is already hidden behind CONFIG_SMP anyway. Signed-off-by: Samuel Holland <samuel@sholland.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220701200056.46555-4-samuel@sholland.org
2022-07-07genirq: GENERIC_IRQ_IPI depends on SMPSamuel Holland
The generic IPI code depends on the IRQ affinity mask being allocated and initialized. This will not be the case if SMP is disabled. Fix up the remaining driver that selected GENERIC_IRQ_IPI in a non-SMP config. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Samuel Holland <samuel@sholland.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220701200056.46555-3-samuel@sholland.org
2022-07-07genirq: Don't return error on missing optional irq_request_resources()Antonio Borneo
Function irq_chip::irq_request_resources() is reported as optional in the declaration of struct irq_chip. If the parent irq_chip does not implement it, we should ignore it and return. Don't return error if the functions is missing. Signed-off-by: Antonio Borneo <antonio.borneo@foss.st.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220512160544.13561-1-antonio.borneo@foss.st.com
2022-07-05bpf, libbpf: Add type match supportDaniel Müller
This patch adds support for the proposed type match relation to relo_core where it is shared between userspace and kernel. It plumbs through both kernel-side and libbpf-side support. The matching relation is defined as follows (copy from source): - modifiers and typedefs are stripped (and, hence, effectively ignored) - generally speaking types need to be of same kind (struct vs. struct, union vs. union, etc.) - exceptions are struct/union behind a pointer which could also match a forward declaration of a struct or union, respectively, and enum vs. enum64 (see below) Then, depending on type: - integers: - match if size and signedness match - arrays & pointers: - target types are recursively matched - structs & unions: - local members need to exist in target with the same name - for each member we recursively check match unless it is already behind a pointer, in which case we only check matching names and compatible kind - enums: - local variants have to have a match in target by symbolic name (but not numeric value) - size has to match (but enum may match enum64 and vice versa) - function pointers: - number and position of arguments in local type has to match target - for each argument and the return value we recursively check match Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220628160127.607834-5-deso@posteo.net
2022-07-05MAINTAINERS: Add Paul as context tracking maintainerFrederic Weisbecker
Since most of the bits have been imported from kernel/rcu/tree.c and now that the context tracking code is tightly linked to RCU, add Paul as a context tracking maintainer. Also update the context tracking file header accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05context_tracking: Convert state to atomic_tFrederic Weisbecker
Context tracking's state and dynticks counter are going to be merged in a single field so that both updates can happen atomically and at the same time. Prepare for that with converting the state into an atomic_t. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context-tracking: Remove unused and/or unecessary middle functionsFrederic Weisbecker
Some eqs functions are now only used internally by context tracking, so their public declarations can be removed. Also middle functions such as rcu_user_*() and rcu_idle_*() which now directly call to rcu_eqs_enter() and rcu_eqs_exit() can be wiped out as well. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context-tracking: Move RCU-dynticks internal functions to context_trackingFrederic Weisbecker
Move the core RCU eqs/dynticks functions to context tracking so that we can later merge all that code within context tracking. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context-tracking: Move deferred nocb resched to context trackingFrederic Weisbecker
To prepare for migrating the RCU eqs accounting code to context tracking, split the last-resort deferred nocb resched from rcu_user_enter() and move it into a separate call from context tracking. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context_tracking: Move dynticks_nmi_nesting to context trackingFrederic Weisbecker
The RCU eqs tracking is going to be performed by the context tracking subsystem. The related nesting counters thus need to be moved to the context tracking structure. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context_tracking: Move dynticks_nesting to context trackingFrederic Weisbecker
The RCU eqs tracking is going to be performed by the context tracking subsystem. The related nesting counters thus need to be moved to the context tracking structure. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context_tracking: Move dynticks counter to context trackingFrederic Weisbecker
In order to prepare for merging RCU dynticks counter into the context tracking state, move the rcu_data's dynticks field to the context tracking structure. It will later be mixed within the context tracking state itself. [ paulmck: Move enum ctx_state into global scope. ] Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05rcu/context-tracking: Remove rcu_irq_enter/exit()Frederic Weisbecker
Now rcu_irq_enter/exit() is an unnecessary middle call between ct_irq_enter/exit() and nmi_irq_enter/exit(). Take this opportunity to remove the former functions and move the comments above them to the new entrypoints. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05context_tracking: Take NMI eqs entrypoints over RCUFrederic Weisbecker
The RCU dynticks counter is going to be merged into the context tracking subsystem. Prepare with moving the NMI extended quiescent states entrypoints to context tracking. For now those are dumb redirection to existing RCU calls. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05context_tracking: Take IRQ eqs entrypoints over RCUFrederic Weisbecker
The RCU dynticks counter is going to be merged into the context tracking subsystem. Prepare with moving the IRQ extended quiescent states entrypoints to context tracking. For now those are dumb redirection to existing RCU calls. [ paulmck: Apply Stephen Rothwell feedback from -next. ] [ paulmck: Apply Nathan Chancellor feedback. ] Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-05context_tracking: Take idle eqs entrypoints over RCUFrederic Weisbecker
The RCU dynticks counter is going to be merged into the context tracking subsystem. Start with moving the idle extended quiescent states entrypoints to context tracking. For now those are dumb redirections to existing RCU calls. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-07-04sched/core: Use try_cmpxchg in set_nr_{and_not,if}_pollingUros Bizjak
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg. The definition of cmpxchg based fetch_or was changed in the same way as atomic_fetch_##op definitions were changed in e6790e4b5d5e97dc287f3496dd2cf2dbabdfdb35. Also declare these two functions as inline to ensure inlining. In the case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart itself by constructing the boolean return value with logic operations on the fetched value, and these extra operations enlarge the function over the inlining threshold value. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
2022-07-04sched/core: add forced idle accounting for cgroupsJosh Don
4feee7d1260 previously added per-task forced idle accounting. This patch extends this to also include cgroups. rstat is used for cgroup accounting, except for the root, which uses kcpustat in order to bypass the need for doing an rstat flush when reading root stats. Only cgroup v2 is supported. Similar to the task accounting, the cgroup accounting requires that schedstats is enabled. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
2022-07-03mm: shrinkers: provide shrinkers with namesRoman Gushchin
Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2022-07-02 We've added 7 non-merge commits during the last 14 day(s) which contain a total of 6 files changed, 193 insertions(+), 86 deletions(-). The main changes are: 1) Fix clearing of page contiguity when unmapping XSK pool, from Ivan Malov. 2) Two verifier fixes around bounds data propagation, from Daniel Borkmann. 3) Fix fprobe sample module's parameter descriptions, from Masami Hiramatsu. 4) General BPF maintainer entry revamp to better scale patch reviews. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, selftests: Add verifier test case for jmp32's jeq/jne bpf, selftests: Add verifier test case for imm=0,umin=0,umax=1 scalar bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals bpf: Fix incorrect verifier simulation around jmp32's jeq/jne xsk: Clear page contiguity bit when unmapping pool bpf, docs: Better scale maintenance of BPF subsystem fprobe, samples: Add module parameter descriptions ==================== Link: https://lore.kernel.org/r/20220701230121.10354-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-01panic: Taint kernel if tests are runDavid Gow
Most in-kernel tests (such as KUnit tests) are not supposed to run on production systems: they may do deliberately illegal things to trigger errors, and have security implications (for example, KUnit assertions will often deliberately leak kernel addresses). Add a new taint type, TAINT_TEST to signal that a test has been run. This will be printed as 'N' (originally for kuNit, as every other sensible letter was taken.) This should discourage people from running these tests on production systems, and to make it easier to tell if tests have been run accidentally (by loading the wrong configuration, etc.) Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-07-01module: Fix "warning: variable 'exit' set but not used"Christophe Leroy
When CONFIG_MODULE_UNLOAD is not selected, 'exit' is set but never used. It is not possible to replace the #ifdef CONFIG_MODULE_UNLOAD by IS_ENABLED(CONFIG_MODULE_UNLOAD) because mod->exit doesn't exist when CONFIG_MODULE_UNLOAD is not selected. And because of the rcu_read_lock_sched() section it is not easy to regroup everything in a single #ifdef. Let's regroup partially and add missing #ifdef to completely opt out the use of 'exit' when CONFIG_MODULE_UNLOAD is not selected. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-07-01module: Fix selfAssignment cppcheck warningChristophe Leroy
cppcheck reports the following warnings: kernel/module/main.c:1455:26: warning: Redundant assignment of 'mod->core_layout.size' to itself. [selfAssignment] mod->core_layout.size = strict_align(mod->core_layout.size); ^ kernel/module/main.c:1489:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment] mod->init_layout.size = strict_align(mod->init_layout.size); ^ kernel/module/main.c:1493:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment] mod->init_layout.size = strict_align(mod->init_layout.size); ^ kernel/module/main.c:1504:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment] mod->init_layout.size = strict_align(mod->init_layout.size); ^ kernel/module/main.c:1459:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment] mod->data_layout.size = strict_align(mod->data_layout.size); ^ kernel/module/main.c:1463:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment] mod->data_layout.size = strict_align(mod->data_layout.size); ^ kernel/module/main.c:1467:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment] mod->data_layout.size = strict_align(mod->data_layout.size); ^ This is due to strict_align() being a no-op when CONFIG_STRICT_MODULE_RWX is not selected. Transform strict_align() macro into an inline function. It will allow type checking and avoid the selfAssignment warning. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-07-01modules: Fix corruption of /proc/kallsymsAdrian Hunter
The commit 91fb02f31505 ("module: Move kallsyms support into a separate file") changed from using strlcpy() to using strscpy() which created a buffer overflow. That happened because: 1) an incorrect value was passed as the buffer length 2) strscpy() (unlike strlcpy()) may copy beyond the length of the input string when copying word-by-word. The assumption was that because it was already known that the strings being copied would fit in the space available, it was not necessary to correctly set the buffer length. strscpy() breaks that assumption because although it will not touch bytes beyond the given buffer length it may write bytes beyond the input string length when writing word-by-word. The result of the buffer overflow is to corrupt the symbol type information that follows. e.g. $ sudo cat -v /proc/kallsyms | grep '\^' | head ffffffffc0615000 ^@ rfcomm_session_get [rfcomm] ffffffffc061c060 ^@ session_list [rfcomm] ffffffffc06150d0 ^@ rfcomm_send_frame [rfcomm] ffffffffc0615130 ^@ rfcomm_make_uih [rfcomm] ffffffffc07ed58d ^@ bnep_exit [bnep] ffffffffc07ec000 ^@ bnep_rx_control [bnep] ffffffffc07ec1a0 ^@ bnep_session [bnep] ffffffffc07e7000 ^@ input_leds_event [input_leds] ffffffffc07e9000 ^@ input_leds_handler [input_leds] ffffffffc07e7010 ^@ input_leds_disconnect [input_leds] Notably, the null bytes (represented above by ^@) can confuse tools. Fix by correcting the buffer length. Fixes: 91fb02f31505 ("module: Move kallsyms support into a separate file") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-07-01bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_valsDaniel Borkmann
Kuee reported a corner case where the tnum becomes constant after the call to __reg_bound_offset(), but the register's bounds are not, that is, its min bounds are still not equal to the register's max bounds. This in turn allows to leak pointers through turning a pointer register as is into an unknown scalar via adjust_ptr_min_max_vals(). Before: func#0 @0 0: R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 0: (b7) r0 = 1 ; R0_w=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) 1: (b7) r3 = 0 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) 2: (87) r3 = -r3 ; R3_w=scalar() 3: (87) r3 = -r3 ; R3_w=scalar() 4: (47) r3 |= 32767 ; R3_w=scalar(smin=-9223372036854743041,umin=32767,var_off=(0x7fff; 0xffffffffffff8000),s32_min=-2147450881) 5: (75) if r3 s>= 0x0 goto pc+1 ; R3_w=scalar(umin=9223372036854808575,var_off=(0x8000000000007fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 6: (95) exit from 5 to 7: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 7: (d5) if r3 s<= 0x8000 goto pc+1 ; R3=scalar(umin=32769,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 8: (95) exit from 7 to 9: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 9: (07) r3 += -32767 ; R3_w=scalar(imm=0,umax=1,var_off=(0x0; 0x0)) <--- [*] 10: (95) exit What can be seen here is that R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) after the operation R3 += -32767 results in a 'malformed' constant, that is, R3_w=scalar(imm=0,umax=1,var_off=(0x0; 0x0)). Intersecting with var_off has not been done at that point via __update_reg_bounds(), which would have improved the umax to be equal to umin. Refactor the tnum <> min/max bounds information flow into a reg_bounds_sync() helper and use it consistently everywhere. After the fix, bounds have been corrected to R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) and thus the register is regarded as a 'proper' constant scalar of 0. After: func#0 @0 0: R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 0: (b7) r0 = 1 ; R0_w=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) 1: (b7) r3 = 0 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) 2: (87) r3 = -r3 ; R3_w=scalar() 3: (87) r3 = -r3 ; R3_w=scalar() 4: (47) r3 |= 32767 ; R3_w=scalar(smin=-9223372036854743041,umin=32767,var_off=(0x7fff; 0xffffffffffff8000),s32_min=-2147450881) 5: (75) if r3 s>= 0x0 goto pc+1 ; R3_w=scalar(umin=9223372036854808575,var_off=(0x8000000000007fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 6: (95) exit from 5 to 7: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 7: (d5) if r3 s<= 0x8000 goto pc+1 ; R3=scalar(umin=32769,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 8: (95) exit from 7 to 9: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 9: (07) r3 += -32767 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) <--- [*] 10: (95) exit Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220701124727.11153-2-daniel@iogearbox.net
2022-07-01bpf: Fix incorrect verifier simulation around jmp32's jeq/jneDaniel Borkmann
Kuee reported a quirk in the jmp32's jeq/jne simulation, namely that the register value does not match expectations for the fall-through path. For example: Before fix: 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r2 = 0 ; R2_w=P0 1: (b7) r6 = 563 ; R6_w=P563 2: (87) r2 = -r2 ; R2_w=Pscalar() 3: (87) r2 = -r2 ; R2_w=Pscalar() 4: (4c) w2 |= w6 ; R2_w=Pscalar(umin=563,umax=4294967295,var_off=(0x233; 0xfffffdcc),s32_min=-2147483085) R6_w=P563 5: (56) if w2 != 0x8 goto pc+1 ; R2_w=P571 <--- [*] 6: (95) exit R0 !read_ok After fix: 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r2 = 0 ; R2_w=P0 1: (b7) r6 = 563 ; R6_w=P563 2: (87) r2 = -r2 ; R2_w=Pscalar() 3: (87) r2 = -r2 ; R2_w=Pscalar() 4: (4c) w2 |= w6 ; R2_w=Pscalar(umin=563,umax=4294967295,var_off=(0x233; 0xfffffdcc),s32_min=-2147483085) R6_w=P563 5: (56) if w2 != 0x8 goto pc+1 ; R2_w=P8 <--- [*] 6: (95) exit R0 !read_ok As can be seen on line 5 for the branch fall-through path in R2 [*] is that given condition w2 != 0x8 is false, verifier should conclude that r2 = 8 as upper 32 bit are known to be zero. However, verifier incorrectly concludes that r2 = 571 which is far off. The problem is it only marks false{true}_reg as known in the switch for JE/NE case, but at the end of the function, it uses {false,true}_{64,32}off to update {false,true}_reg->var_off and they still hold the prior value of {false,true}_reg->var_off before it got marked as known. The subsequent __reg_combine_32_into_64() then propagates this old var_off and derives new bounds. The information between min/max bounds on {false,true}_reg from setting the register to known const combined with the {false,true}_reg->var_off based on the old information then derives wrong register data. Fix it by detangling the BPF_JEQ/BPF_JNE cases and updating relevant {false,true}_{64,32}off tnums along with the register marking to known constant. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220701124727.11153-1-daniel@iogearbox.net
2022-07-01remove CONFIG_ANDROIDChristoph Hellwig
The ANDROID config symbol is only used to guard the binder config symbol and to inject completely random config changes. Remove it as it is obviously a bad idea. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220629150102.1582425-2-hch@lst.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-01pm/sleep: Add PM_USERSPACE_AUTOSLEEP KconfigKalesh Singh
Systems that initiate frequent suspend/resume from userspace can make the kernel aware by enabling PM_USERSPACE_AUTOSLEEP config. This allows for certain sleep-sensitive code (wireguard/rng) to decide on what preparatory work should be performed (or not) in their pm_notification callbacks. This patch was prompted by the discussion at [1] which attempts to remove CONFIG_ANDROID that currently guards these code paths. [1] https://lore.kernel.org/r/20220629150102.1582425-1-hch@lst.de/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Link: https://lore.kernel.org/r/20220630191230.235306-1-kaleshsingh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c 9c5de246c1db ("net: sparx5: mdb add/del handle non-sparx5 devices") fbb89d02e33a ("net: sparx5: Allow mdb entries to both CPU and ports") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-30sysctl: add proc_dointvec_ms_jiffies_minmaxYuwei Wang
add proc_dointvec_ms_jiffies_minmax to fit read msecs value to jiffies with a limited range of values Signed-off-by: Yuwei Wang <wangyuweihx@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-29context_tracking: Split user tracking KconfigFrederic Weisbecker
Context tracking is going to be used not only to track user transitions but also idle/IRQs/NMIs. The user tracking part will then become a separate feature. Prepare Kconfig for that. [ frederic: Apply Max Filippov feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-29context_tracking: Rename context_tracking_cpu_set() to ct_cpu_track_user()Frederic Weisbecker
context_tracking_cpu_set() is called in order to tell a CPU to track user/kernel transitions. Since context tracking is going to expand in to also track transitions from/to idle/IRQ/NMIs, the scope of this function name becomes too broad and needs to be made more specific. Also shorten the prefix to align with the new namespace. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-29context_tracking: Rename context_tracking_enter/exit() to ct_user_enter/exit()Frederic Weisbecker
context_tracking_enter() and context_tracking_exit() have confusing names that don't explain the fact they are referring to user/guest state. Use more self-explanatory names and shrink to the new context tracking prefix instead. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-29context_tracking: Rename context_tracking_user_enter/exit() to ↵Frederic Weisbecker
user_enter/exit_callable() context_tracking_user_enter() and context_tracking_user_exit() are ASM callable versions of user_enter() and user_exit() for architectures that didn't manage to check the context tracking static key from ASM. Change those function names to better reflect their purpose. [ frederic: Apply Max Filippov feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-29bpf: expose bpf_{g,s}etsockopt to lsm cgroupStanislav Fomichev
I don't see how to make it nice without introducing btf id lists for the hooks where these helpers are allowed. Some LSM hooks work on the locked sockets, some are triggering early and don't grab any locks, so have two lists for now: 1. LSM hooks which trigger under socket lock - minority of the hooks, but ideal case for us, we can expose existing BTF-based helpers 2. LSM hooks which trigger without socket lock, but they trigger early in the socket creation path where it should be safe to do setsockopt without any locks 3. The rest are prohibited. I'm thinking that this use-case might be a good gateway to sleeping lsm cgroup hooks in the future. We can either expose lock/unlock operations (and add tracking to the verifier) or have another set of bpf_setsockopt wrapper that grab the locks and might sleep. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUPStanislav Fomichev
We have two options: 1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id 2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point I was doing (2) in the original patch, but switching to (1) here: * bpf_prog_query returns all attached BPF_LSM_CGROUP programs regardless of attach_btf_id * attach_btf_id is exported via bpf_prog_info Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: minimize number of allocated lsm slots per programStanislav Fomichev
Previous patch adds 1:1 mapping between all 211 LSM hooks and bpf_cgroup program array. Instead of reserving a slot per possible hook, reserve 10 slots per cgroup for lsm programs. Those slots are dynamically allocated on demand and reclaimed. struct cgroup_bpf { struct bpf_prog_array * effective[33]; /* 0 264 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ struct hlist_head progs[33]; /* 264 264 */ /* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */ u8 flags[33]; /* 528 33 */ /* XXX 7 bytes hole, try to pack */ struct list_head storages; /* 568 16 */ /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */ struct bpf_prog_array * inactive; /* 584 8 */ struct percpu_ref refcnt; /* 592 16 */ struct work_struct release_work; /* 608 72 */ /* size: 680, cachelines: 11, members: 7 */ /* sum members: 673, holes: 1, sum holes: 7 */ /* last cacheline: 40 bytes */ }; Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: per-cgroup lsm flavorStanislav Fomichev
Allow attaching to lsm hooks in the cgroup context. Attaching to per-cgroup LSM works exactly like attaching to other per-cgroup hooks. New BPF_LSM_CGROUP is added to trigger new mode; the actual lsm hook we attach to is signaled via existing attach_btf_id. For the hooks that have 'struct socket' or 'struct sock' as its first argument, we use the cgroup associated with that socket. For the rest, we use 'current' cgroup (this is all on default hierarchy == v2 only). Note that for some hooks that work on 'struct sock' we still take the cgroup from 'current' because some of them work on the socket that hasn't been properly initialized yet. Behind the scenes, we allocate a shim program that is attached to the trampoline and runs cgroup effective BPF programs array. This shim has some rudimentary ref counting and can be shared between several programs attaching to the same lsm hook from different cgroups. Note that this patch bloats cgroup size because we add 211 cgroup_bpf_attach_type(s) for simplicity sake. This will be addressed in the subsequent patch. Also note that we only add non-sleepable flavor for now. To enable sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu, shim programs have to be freed via trace rcu, cgroup_bpf.effective should be also trace-rcu-managed + maybe some other changes that I'm not aware of. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: convert cgroup_bpf.progs to hlistStanislav Fomichev
This lets us reclaim some space to be used by new cgroup lsm slots. Before: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct list_head progs[23]; /* 184 368 */ /* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */ u32 flags[23]; /* 552 92 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ struct list_head storages; /* 648 16 */ struct bpf_prog_array * inactive; /* 664 8 */ struct percpu_ref refcnt; /* 672 16 */ struct work_struct release_work; /* 688 32 */ /* size: 720, cachelines: 12, members: 7 */ /* sum members: 716, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; After: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct hlist_head progs[23]; /* 184 184 */ /* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */ u8 flags[23]; /* 368 23 */ /* XXX 1 byte hole, try to pack */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct list_head storages; /* 392 16 */ struct bpf_prog_array * inactive; /* 408 8 */ struct percpu_ref refcnt; /* 416 16 */ struct work_struct release_work; /* 432 72 */ /* size: 504, cachelines: 8, members: 7 */ /* sum members: 503, holes: 1, sum holes: 1 */ /* last cacheline: 56 bytes */ }; Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: add bpf_func_t and trampoline helpersStanislav Fomichev
I'll be adding lsm cgroup specific helpers that grab trampoline mutex. No functional changes. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29gfs2: Add glockfd debugfs fileAndreas Gruenbacher
When a process has a gfs2 file open, the file is keeping a reference on the underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in shared mode. In other words, the process depends on the iopen glock of each open gfs2 file. Expose those dependencies in a new "glockfd" debugfs file. The new debugfs file contains one line for each gfs2 file descriptor, specifying the tgid, file descriptor number, and glock name, e.g., 1601 6 5/816d This list is compiled by iterating all tasks on the system using find_ge_pid(), and all file descriptors of each task using task_lookup_next_fd_rcu(). To make that work from gfs2, export those two functions. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-06-28block: simplify blktrace sysfs attribute creationChristoph Hellwig
Add the trace attributes to the default gendisk attributes, just like we already do for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28sched/fair: Remove the energy margin in feec()Vincent Donnefort
find_energy_efficient_cpu() integrates a margin to protect tasks from bouncing back and forth from a CPU to another. This margin is set as being 6% of the total current energy estimated on the system. This however does not work for two reasons: 1. The energy estimation is not a good absolute value: compute_energy() used in feec() is a good estimation for task placement as it allows to compare the energy with and without a task. The computed delta will give a good overview of the cost for a certain task placement. It, however, doesn't work as an absolute estimation for the total energy of the system. First it adds the contribution to idle CPUs into the energy, second it mixes util_avg with util_est values. util_avg contains the near history for a CPU usage, it doesn't tell at all what the current utilization is. A system that has been quite busy in the near past will hold a very high energy and then a high margin preventing any task migration to a lower capacity CPU, wasting energy. It even creates a negative feedback loop: by holding the tasks on a less efficient CPU, the margin contributes in keeping the energy high. 2. The margin handicaps small tasks: On a system where the workload is composed mostly of small tasks (which is often the case on Android), the overall energy will be high enough to create a margin none of those tasks can cross. On a Pixel4, a small utilization of 5% on all the CPUs creates a global estimated energy of 140 joules, as per the Energy Model declaration of that same device. This means, after applying the 6% margin that any migration must save more than 8 joules to happen. No task with a utilization lower than 40 would then be able to migrate away from the biggest CPU of the system. The 6% of the overall system energy was brought by the following patch: (eb92692b2544 sched/fair: Speed-up energy-aware wake-ups) It was previously 6% of the prev_cpu energy. Also, the following one made this margin value conditional on the clusters where the task fits: (8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary) We could simply revert that margin change to what it was, but the original version didn't have strong grounds neither and as demonstrated in (1.) the estimated energy isn't a good absolute value. Instead, removing it completely. It is indeed, made possible by recent changes that improved energy estimation comparison fairness (sched/fair: Remove task_util from effective utilization in feec()) (PM: EM: Increase energy calculation precision) and task utilization stabilization (sched/fair: Decay task util_avg during migration) Without a margin, we could have feared bouncing between CPUs. But running LISA's eas_behaviour test coverage on three different platforms (Hikey960, RB-5 and DB-845) showed no issue. Removing the energy margin enables more energy-optimized placements for a more energy efficient system. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
2022-06-28sched/fair: Remove task_util from effective utilization in feec()Vincent Donnefort
The energy estimation in find_energy_efficient_cpu() (feec()) relies on the computation of the effective utilization for each CPU of a perf domain (PD). This effective utilization is then used as an estimation of the busy time for this pd. The function effective_cpu_util() which gives this value, scales the utilization relative to IRQ pressure on the CPU to take into account that the IRQ time is hidden from the task clock. The IRQ scaling is as follow: effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of the CPU and irq the IRQ avg time. If now we take as an example a task placement which doesn't raise the OPP on the candidate CPU, we can write the energy delta as: delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) - effective_cpu_util(cpu_util)) = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util We end-up with an energy delta depending on the IRQ avg time, which is a problem: first the time spent on IRQs by a CPU has no effect on the additional energy that would be consumed by a task. Second, we don't want to favour a CPU with a higher IRQ avg time value. Nonetheless, we need to take the IRQ avg time into account. If a task placement raises the PD's frequency, it will increase the energy cost for the entire time where the CPU is busy. A solution is to only use effective_cpu_util() with the CPU contribution part. The task contribution is added separately and scaled according to prev_cpu's IRQ time. No change for the FREQUENCY_UTIL component of the energy estimation. We still want to get the actual frequency that would be selected after the task placement. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
2022-06-28sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()Dietmar Eggemann
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays invariant after Energy Model creation, i.e. it is not updated after CPU hotplug operations. That's why the PD mask is used in conjunction with the cpu_online_mask (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched multiple times (in compute_energy()) during a run-queue selection for a task. cpu_online_mask may change during this time which can lead to wrong energy calculations. To be able to avoid this, use the select_rq_mask per-cpu cpumask to create a cpumask out of PD cpumask and cpu_online_mask and pass it through the function calls of the EAS run-queue selection path. The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection (find_energy_efficient_cpu()) is now ANDed not only with the SD mask but also with the cpu_online_mask. This is fine since this cpumask has to be in syc with the one used for energy computation (compute_energy()). An exclusive cpuset setup with at least one asymmetric CPU capacity island (hence the additional AND with the SD cpumask) is the obvious exception here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
2022-06-28sched/fair: Rename select_idle_mask to select_rq_maskDietmar Eggemann
On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann <dietmar.eggemann@arm.com> https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
2022-06-28sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()Dietmar Eggemann
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com