summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-09-09ACPI: s2idle: Add a new ->check() callback for platform_s2idle_opsMario Limonciello
On some platforms it is found that Linux more aggressively enters s2idle than Windows enters Modern Standby and this uncovers some synchronization issues for the platform. To aid in debugging this class of problems in the future, add support for an extra optional callback intended for drivers to emit extra debugging. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220829162953.5947-2-mario.limonciello@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-09-09sched/psi: Per-cgroup PSI accounting disable/re-enable interfaceChengming Zhou
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment. So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default. Implementation details: It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps. But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-09sched/psi: Cache parent psi_group to speed up group iterationChengming Zhou
We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group in struct psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com
2022-09-09sched/psi: Consolidate cgroup_psi()Chengming Zhou
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi". This patch move cgroup_psi() definition to <linux/psi.h>, in which we can return psi_system for root cgroup, so can handle all cgroups. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com
2022-09-09sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressureChengming Zhou
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
2022-09-09sched/psi: Remove NR_ONCPU task accountingJohannes Weiner
We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance. Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use one bit in state_mask to track instead. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com
2022-09-09sched/psi: Optimize task switch inside shared cgroups againChengming Zhou
Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done. Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks. We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks. This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating, which is simply catching up on a missed optimization that could have been done in commit 7fae6c8171d2 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly. Apart from this, it's also necessary for the next patch "sched/psi: remove NR_ONCPU task accounting". Suppose we walk the common ancestors twice: (1) psi_group_change(.clear = 0, .set = TSK_ONCPU) (2) psi_group_change(.clear = TSK_ONCPU, .set = 0) We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++ in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct. The next patch change to use one bit in state mask to record TSK_ONCPU, PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set! With this patch, we will never walk the common ancestors twice, so won't have above problem. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-6-zhouchengming@bytedance.com
2022-09-09sched/psi: Move private helpers to sched/stats.hChengming Zhou
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-5-zhouchengming@bytedance.com
2022-09-09sched/psi: Save percpu memory when !psi_cgroups_enabledChengming Zhou
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-4-zhouchengming@bytedance.com
2022-09-09sched/psi: Don't create cgroup PSI files when psi_disabledChengming Zhou
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-3-zhouchengming@bytedance.com
2022-09-09sched/psi: Fix periodic aggregation shut offChengming Zhou
We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-2-zhouchengming@bytedance.com
2022-09-09Merge branch 'driver-core/driver-core-next'Peter Zijlstra
Pull in dependent cgroup patches Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-08module/decompress: generate sysfs string at compile timeDavid Disseldorp
compression_show() before (with noinline): 0xffffffff810b5ff0 <+0>: mov %rdx,%rdi 0xffffffff810b5ff3 <+3>: mov $0xffffffff81b55629,%rsi 0xffffffff810b5ffa <+10>: mov $0xffffffff81b0cde2,%rdx 0xffffffff810b6001 <+17>: call 0xffffffff811b8fd0 <sysfs_emit> 0xffffffff810b6006 <+22>: cltq 0xffffffff810b6008 <+24>: ret After: 0xffffffff810b5ff0 <+0>: mov $0xffffffff81b0cde2,%rsi 0xffffffff810b5ff7 <+7>: mov %rdx,%rdi 0xffffffff810b5ffa <+10>: call 0xffffffff811b8fd0 <sysfs_emit> 0xffffffff810b5fff <+15>: cltq 0xffffffff810b6001 <+17>: ret Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of ↵Liu Shixin
i_{zero/one_hundred} It is better to use SYSCTL_ZERO and SYSCTL_ONE_HUNDRED instead of &i_zero and &i_one_hundred, and then we can remove these two local variable. No functional change. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08kernel/sysctl.c: move sysctl_vals and sysctl_long_vals to sysctl.cLiu Shixin
sysctl_vals and sysctl_long_vals are declared even if sysctl is disabled. Move its definition to sysctl.c to make sure their integrity in any case. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08sysctl: remove max_extfrag_thresholdLiu Shixin
Remove max_extfrag_threshold and replace by SYSCTL_ONE_THOUSAND. No functional change. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08module: Add debugfs interface to view unloaded tainted modulesAaron Tomlin
This patch provides debug/modules/unloaded_tainted file to see a record of unloaded tainted modules. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08kernel/sysctl.c: remove unnecessary (void*) conversionsDong Chuanjian
remove unnecessary void* type casting Signed-off-by: Dong Chuanjian <chuanjian@nfschina.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-09-08kprobes: Prohibit probes in gate areaChristian A. Ehrhardt
The system call gate area counts as kernel text but trying to install a kprobe in this area fails with an Oops later on. To fix this explicitly disallow the gate area for kprobes. Found by syzkaller with the following reproducer: perf_event_open$cgroup(&(0x7f00000001c0)={0x6, 0x80, 0x0, 0x0, 0x0, 0x0, 0x80ffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_config_ext={0x0, 0xffffffffff600000}}, 0xffffffffffffffff, 0x0, 0xffffffffffffffff, 0x0) Sample report: BUG: unable to handle page fault for address: fffffbfff3ac6000 PGD 6dfcb067 P4D 6dfcb067 PUD 6df8f067 PMD 6de4d067 PTE 0 Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 21978 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b-dirty #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__insn_get_emulate_prefix arch/x86/lib/insn.c:91 [inline] RIP: 0010:insn_get_emulate_prefix arch/x86/lib/insn.c:106 [inline] RIP: 0010:insn_get_prefixes.part.0+0xa8/0x1110 arch/x86/lib/insn.c:134 Code: 49 be 00 00 00 00 00 fc ff df 48 8b 40 60 48 89 44 24 08 e9 81 00 00 00 e8 e5 4b 39 ff 4c 89 fa 4c 89 f9 48 c1 ea 03 83 e1 07 <42> 0f b6 14 32 38 ca 7f 08 84 d2 0f 85 06 10 00 00 48 89 d8 48 89 RSP: 0018:ffffc900088bf860 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffff9b9bebc0 RCX: 0000000000000000 RDX: 1ffffffff3ac6000 RSI: ffffc90002d82000 RDI: ffffc900088bf9e8 RBP: ffffffff9d630001 R08: 0000000000000000 R09: ffffc900088bf9e8 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 R13: ffffffff9d630000 R14: dffffc0000000000 R15: ffffffff9d630000 FS: 00007f63eef63640(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff3ac6000 CR3: 0000000029d90005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> insn_get_prefixes arch/x86/lib/insn.c:131 [inline] insn_get_opcode arch/x86/lib/insn.c:272 [inline] insn_get_modrm+0x64a/0x7b0 arch/x86/lib/insn.c:343 insn_get_sib+0x29a/0x330 arch/x86/lib/insn.c:421 insn_get_displacement+0x350/0x6b0 arch/x86/lib/insn.c:464 insn_get_immediate arch/x86/lib/insn.c:632 [inline] insn_get_length arch/x86/lib/insn.c:707 [inline] insn_decode+0x43a/0x490 arch/x86/lib/insn.c:747 can_probe+0xfc/0x1d0 arch/x86/kernel/kprobes/core.c:282 arch_prepare_kprobe+0x79/0x1c0 arch/x86/kernel/kprobes/core.c:739 prepare_kprobe kernel/kprobes.c:1160 [inline] register_kprobe kernel/kprobes.c:1641 [inline] register_kprobe+0xb6e/0x1690 kernel/kprobes.c:1603 __register_trace_kprobe kernel/trace/trace_kprobe.c:509 [inline] __register_trace_kprobe+0x26a/0x2d0 kernel/trace/trace_kprobe.c:477 create_local_trace_kprobe+0x1f7/0x350 kernel/trace/trace_kprobe.c:1833 perf_kprobe_init+0x18c/0x280 kernel/trace/trace_event_perf.c:271 perf_kprobe_event_init+0xf8/0x1c0 kernel/events/core.c:9888 perf_try_init_event+0x12d/0x570 kernel/events/core.c:11261 perf_init_event kernel/events/core.c:11325 [inline] perf_event_alloc.part.0+0xf7f/0x36a0 kernel/events/core.c:11619 perf_event_alloc kernel/events/core.c:12059 [inline] __do_sys_perf_event_open+0x4a8/0x2a00 kernel/events/core.c:12157 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f63ef7efaed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f63eef63028 EFLAGS: 00000246 ORIG_RAX: 000000000000012a RAX: ffffffffffffffda RBX: 00007f63ef90ff80 RCX: 00007f63ef7efaed RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 00000000200001c0 RBP: 00007f63ef86019c R08: 0000000000000000 R09: 0000000000000000 R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000002 R14: 00007f63ef90ff80 R15: 00007f63eef43000 </TASK> Modules linked in: CR2: fffffbfff3ac6000 ---[ end trace 0000000000000000 ]--- RIP: 0010:__insn_get_emulate_prefix arch/x86/lib/insn.c:91 [inline] RIP: 0010:insn_get_emulate_prefix arch/x86/lib/insn.c:106 [inline] RIP: 0010:insn_get_prefixes.part.0+0xa8/0x1110 arch/x86/lib/insn.c:134 Code: 49 be 00 00 00 00 00 fc ff df 48 8b 40 60 48 89 44 24 08 e9 81 00 00 00 e8 e5 4b 39 ff 4c 89 fa 4c 89 f9 48 c1 ea 03 83 e1 07 <42> 0f b6 14 32 38 ca 7f 08 84 d2 0f 85 06 10 00 00 48 89 d8 48 89 RSP: 0018:ffffc900088bf860 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffff9b9bebc0 RCX: 0000000000000000 RDX: 1ffffffff3ac6000 RSI: ffffc90002d82000 RDI: ffffc900088bf9e8 RBP: ffffffff9d630001 R08: 0000000000000000 R09: ffffc900088bf9e8 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 R13: ffffffff9d630000 R14: dffffc0000000000 R15: ffffffff9d630000 FS: 00007f63eef63640(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff3ac6000 CR3: 0000000029d90005 CR4: 0000000000770ef0 PKRU: 55555554 ================================================================== Link: https://lkml.kernel.org/r/20220907200917.654103-1-lk@c--e.de cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> cc: "David S. Miller" <davem@davemloft.net> Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-07bpf: Add helper macro bpf_for_each_reg_in_vstateKumar Kartikeya Dwivedi
For a lot of use cases in future patches, we will want to modify the state of registers part of some same 'group' (e.g. same ref_obj_id). It won't just be limited to releasing reference state, but setting a type flag dynamically based on certain actions, etc. Hence, we need a way to easily pass a callback to the function that iterates over all registers in current bpf_verifier_state in all frames upto (and including) the curframe. While in C++ we would be able to easily use a lambda to pass state and the callback together, sadly we aren't using C++ in the kernel. The next best thing to avoid defining a function for each case seems like statement expressions in GNU C. The kernel already uses them heavily, hence they can passed to the macro in the style of a lambda. The statement expression will then be substituted in the for loop bodies. Variables __state and __reg are set to current bpf_func_state and reg for each invocation of the expression inside the passed in verifier state. Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers, release_reference, find_good_pkt_pointers, find_equal_scalars to use bpf_for_each_reg_in_vstate. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07perf: Add a few assertionsPeter Zijlstra
While auditing 6b959ba22d34 ("perf/core: Fix reentry problem in perf_output_read_group()") a few spots were found that wanted assertions. Notable for_each_sibling_event() relies on exclusion from modification. This would normally be holding either ctx->lock or ctx->mutex, however due to how things are constructed disabling IRQs is a valid and sufficient substitute for ctx->lock. Another possible site to add assertions would be the various pmu::{add,del,read,..}() methods, but that's not trivially expressable in C -- the best option is wrappers, but those are easy enough to forget. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-09-07perf: Consolidate branch sample filter helpersAnshuman Khandual
Besides the branch type filtering requests, 'event.attr.branch_sample_type' also contains various flags indicating which additional information should be captured, along with the base branch record. These flags help configure the underlying hardware, and capture the branch records appropriately when required e.g after PMU interrupt. But first, this moves an existing helper perf_sample_save_hw_index() into the header before adding some more helpers for other branch sample filter flags. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220906084414.396220-1-anshuman.khandual@arm.com
2022-09-07freezer,sched: Rewrite core freezer logicPeter Zijlstra
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-09-07sched/completion: Add wait_for_completion_state()Peter Zijlstra
Allows waiting with a custom @state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220822114648.922711674@infradead.org
2022-09-07sched: Add TASK_ANY for wait_task_inactive()Peter Zijlstra
Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
2022-09-07sched: Change wait_task_inactive()s match_statePeter Zijlstra
Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org
2022-09-07freezer,umh: Clean up freezer/initrd interactionPeter Zijlstra
handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure that the UMH, which is going to freeze the system, doesn't indefinitely wait for it's caller. Rework things by adding UMH_FREEZABLE to indicate the completion is freezable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.791019324@infradead.org
2022-09-07freezer: Have {,un}lock_system_sleep() save/restore flagsPeter Zijlstra
Rafael explained that the reason for having both PF_NOFREEZE and PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from kthread context that has previously called set_freezable(). In preparation of merging the flags, have {,un}lock_system_slee() save and restore current->flags. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.725003428@infradead.org
2022-09-07sched: Rename task_running() to task_on_cpu()Peter Zijlstra
There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net
2022-09-07sched/fair: Cleanup for SIS_PROPAbel Wu
The sched-domain of this cpu is only used for some heuristics when SIS_PROP is enabled, and it should be irrelevant whether the local sd_llc is valid or not, since all we care about is target sd_llc if !SIS_PROP. Access the local domain only when there is a need. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220907112000.1854-6-wuyun.abel@bytedance.com
2022-09-07sched/fair: Default to false in test_idle_cores()Abel Wu
It's uncertain whether idle cores exist or not if shared sched- domains are not ready, so returning "no idle cores" usually makes sense. While __update_idle_core() is an exception, it checks status of this core and set hint to shared sched-domain if necessary. So the whole logic of this function depends on the existence of shared sched-domain, and can certainly bail out early if it is not available. It's somehow a little tricky, and as Josh suggested that it should be transient while the domain isn't ready. So remove the self-defined default value to make things more clearer. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-5-wuyun.abel@bytedance.com
2022-09-07sched/fair: Remove useless check in select_idle_core()Abel Wu
The function select_idle_core() only gets called when has_idle_cores is true which can be possible only when sched_smt_present is enabled. This change also aligns select_idle_core() with select_idle_smt() in the way that the caller do the check if necessary. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-4-wuyun.abel@bytedance.com
2022-09-07sched/fair: Avoid double search on same cpuAbel Wu
The prev cpu is checked at the beginning of SIS, and it's unlikely to be idle before the second check in select_idle_smt(). So we'd better focus on its SMT siblings. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-3-wuyun.abel@bytedance.com
2022-09-07sched/fair: Remove redundant check in select_idle_smt()Abel Wu
If two cpus share LLC cache, then the two cores they belong to are also in the same LLC domain. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-2-wuyun.abel@bytedance.com
2022-09-07bpf: Support kptrs in percpu arraymapKumar Kartikeya Dwivedi
Enable support for kptrs in percpu BPF arraymap by wiring up the freeing of these kptrs from percpu map elements. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf: Fix resetting logic for unreferenced kptrsJules Irenge
Sparse reported a warning at bpf_map_free_kptrs() "warning: Using plain integer as NULL pointer" During the process of fixing this warning, it was discovered that the current code erroneously writes to the pointer variable instead of deferencing and writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0). Note that the effect of this bug is that unreferenced kptrs will not be cleared during check_and_free_fields. It is not a problem if the clearing is not done during map_free stage, as there is nothing to free for them. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf/verifier: allow kfunc to return an allocated memBenjamin Tissoires
For drivers (outside of network), the incoming data is not statically defined in a struct. Most of the time the data buffer is kzalloc-ed and thus we can not rely on eBPF and BTF to explore the data. This commit allows to return an arbitrary memory, previously allocated by the driver. An interesting extra point is that the kfunc can mark the exported memory region as read only or read/write. So, when a kfunc is not returning a pointer to a struct but to a plain type, we can consider it is a valid allocated memory assuming that: - one of the arguments is either called rdonly_buf_size or rdwr_buf_size - and this argument is a const from the caller point of view We can then use this parameter as the size of the allocated memory. The memory is either read-only or read-write based on the name of the size parameter. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-7-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf/btf: bump BTF_KFUNC_SET_MAX_CNTBenjamin Tissoires
net/bpf/test_run.c is already presenting 20 kfuncs. net/netfilter/nf_conntrack_bpf.c is also presenting an extra 10 kfuncs. Given that all the kfuncs are regrouped into one unique set, having only 2 space left prevent us to add more selftests. Bump it to 256. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-6-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf/verifier: allow all functions to read user provided contextBenjamin Tissoires
When a function was trying to access data from context in a syscall eBPF program, the verifier was rejecting the call unless it was accessing the first element. This is because the syscall context is not known at compile time, and so we need to check this when actually accessing it. Check for the valid memory access if there is no convert_ctx callback, and allow such situation to happen. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220906151303.2780789-4-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf: split btf_check_subprog_arg_match in twoBenjamin Tissoires
btf_check_subprog_arg_match() was used twice in verifier.c: - when checking for the type mismatches between a (sub)prog declaration and BTF - when checking the call of a subprog to see if the provided arguments are correct and valid This is problematic when we check if the first argument of a program (pointer to ctx) is correctly accessed: To be able to ensure we access a valid memory in the ctx, the verifier assumes the pointer to context is not null. This has the side effect of marking the program accessing the entire context, even if the context is never dereferenced. For example, by checking the context access with the current code, the following eBPF program would fail with -EINVAL if the ctx is set to null from the userspace: ``` SEC("syscall") int prog(struct my_ctx *args) { return 0; } ``` In that particular case, we do not want to actually check that the memory is correct while checking for the BTF validity, but we just want to ensure that the (sub)prog definition matches the BTF we have. So split btf_check_subprog_arg_match() in two so we can actually check for the memory used when in a call, and ignore that part when not. Note that a further patch is in preparation to disentangled btf_check_func_arg_match() from these two purposes, and so right now we just add a new hack around that by adding a boolean to this function. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220906151303.2780789-3-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07cgroup/cpuset: remove unreachable codeJiapeng Chong
The function sched_partition_show cannot execute seq_puts, delete the invalid code. kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-07audit: remove selinux_audit_rule_update() declarationXiu Jianfeng
selinux_audit_rule_update() has been renamed to audit_update_lsm_rules() since commit d7a96f3a1ae2 ("Audit: internally use the new LSM audit hooks"), so remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-09-07kernel/module: add __dyndbg_classes sectionJim Cromie
Add __dyndbg_classes section, using __dyndbg as a model. Use it: vmlinux.lds.h: KEEP the new section, which also silences orphan section warning on loadable modules. Add (__start_/__stop_)__dyndbg_classes linker symbols for the c externs (below). kernel/module/main.c: - fill new fields in find_module_sections(), using section_objs() - extend callchain prototypes to pass classes, length load_module(): pass new info to dynamic_debug_setup() dynamic_debug_setup(): new params, pass through to ddebug_add_module() dynamic_debug.c: - add externs to the linker symbols. ddebug_add_module(): - It currently builds a debug_table, and *will* find and attach classes. dynamic_debug_init(): - add class fields to the _ddebug_info cursor var: di. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-16-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-07dyndbg: gather __dyndbg[] state into struct _ddebug_infoJim Cromie
This new struct composes the linker provided (vector,len) section, and provides a place to add other __dyndbg[] state-data later: descs - the vector of descriptors in __dyndbg section. num_descs - length of the data/section. Use it, in several different ways, as follows: In lib/dynamic_debug.c: ddebug_add_module(): Alter params-list, replacing 2 args (array,index) with a struct _ddebug_info * containing them both, with room for expansion. This helps future-proof the function prototype against the looming addition of class-map info into the dyndbg-state, by providing a place to add more member fields later. NB: later add static struct _ddebug_info builtins_state declaration, not needed yet. ddebug_add_module() is called in 2 contexts: In dynamic_debug_init(), declare, init a struct _ddebug_info di auto-var to use as a cursor. Then iterate over the prdbg blocks of the builtin modules, and update the di cursor before calling _add_module for each. Its called from kernel/module/main.c:load_info() for each loaded module: In internal.h, alter struct load_info, replacing the dyndbg array,len fields with an embedded _ddebug_info containing them both; and populate its members in find_module_sections(). The 2 calling contexts differ in that _init deals with contiguous subranges of __dyndbgs[] section, packed together, while loadable modules are added one at a time. So rename ddebug_add_module() into outer/__inner fns, call __inner from _init, and provide the offset into the builtin __dyndbgs[] where the module's prdbgs reside. The cursor provides start, len of the subrange for each. The offset will be used later to pack the results of builtin __dyndbg_sites[] de-duplication, and is 0 and unneeded for loadable modules, Note: kernel/module/main.c includes <dynamic_debug.h> for struct _ddeubg_info. This might be prone to include loops, since its also included by printk.h. Nothing has broken in robot-land on this. cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20220904214134.408619-12-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-07dma-mapping: mark dma_supported staticChristoph Hellwig
Now that the remaining users in drivers are gone, this function can be marked static. Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07swiotlb: fix a typoChao Gao
"overwirte" isn't a word. It should be "overwrite". Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07swiotlb: avoid potential left shift overflowChao Gao
The second operand passed to slot_addr() is declared as int or unsigned int in all call sites. The left-shift to get the offset of a slot can overflow if swiotlb size is larger than 4G. Convert the macro to an inline function and declare the second argument as phys_addr_t to avoid the potential overflow. Fixes: 26a7e094783d ("swiotlb: refactor swiotlb_tbl_map_single") Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07dma-debug: improve search for partial syncsRobin Murphy
When bucket_find_contains() tries to find the original entry for a partial sync, it manages to constrain its search in a way that is both too restrictive and not restrictive enough. A driver which only uses single mappings rather than scatterlists might not set max_seg_size, but could still technically perform a partial sync at an offset of more than 64KB into a sufficiently large mapping, so we could stop searching too early before reaching a legitimate entry. Conversely, if no valid entry is present and max_range is large enough, we can pointlessly search buckets that we've already searched, or that represent an impossible wrapping around the bottom of the address space. At worst, the (legitimate) case of max_seg_size == UINT_MAX can make the loop infinite. Replace the fragile and frankly hard-to-follow "range" logic with a simple counted loop for the number of possible hash buckets below the given address. Reported-by: Yunfei Wang <yf.wang@mediatek.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07Revert "swiotlb: panic if nslabs is too small"Yu Zhao
This reverts commit 0bf28fc40d89b1a3e00d1b79473bad4e9ca20ad1. Reasons: 1. new panic()s shouldn't be added [1]. 2. It does no "cleanup" but breaks MIPS [2]. v2: properly solved the conflict [3] with commit 20347fca71a38 ("swiotlb: split up the global swiotlb lock") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [1] https://lore.kernel.org/r/CAHk-=wit-DmhMfQErY29JSPjFgebx_Ld+pnerc4J2Ag990WwAA@mail.gmail.com/ [2] https://lore.kernel.org/r/20220820012031.1285979-1-yuzhao@google.com/ [3] https://lore.kernel.org/r/202208310701.LKr1WDCh-lkp@intel.com/ Fixes: 0bf28fc40d89b ("swiotlb: panic if nslabs is too small") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06bpf: Allow struct argument in trampoline based programsYonghong Song
Allow struct argument in trampoline based programs where the struct size should be <= 16 bytes. In such cases, the argument will be put into up to 2 registers for bpf, x86_64 and arm64 architectures. To support arch-specific trampoline manipulation, add arg_flags for additional struct information about arguments in btf_func_model. Such information will be used in arch specific function arch_prepare_bpf_trampoline() to prepare argument access properly in trampoline. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>