summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-11-30refscale: Prevent buffer to pr_alert() being too longLi Zhijian
0Day/LKP observed that the refscale results fail to complete when larger values of nrun (such as 300) are specified. The problem is that printk() can accept at most a 1024-byte buffer. This commit therefore prints the buffer whenever its length exceeds 800 bytes. CC: Philip Li <philip.li@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30refscale: Simplify the errexit checkpointLi Zhijian
There is only the one OOM error case in main_func(), so this commit eliminates the errexit local variable in favor of a branch to cleanup code. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Suppress pi-lock-across read-unlock testing for Tiny SRCUPaul E. McKenney
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep complains when a pi lock is held across that srcu_read_unlock(). Although this is a lockdep false positive (there is no other CPU to complete the deadlock cycle), lockdep is what it is at the moment. This commit therefore prevents rcutorture from holding pi lock across a Tiny srcu_read_unlock(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: More thoroughly test nested readersPaul E. McKenney
Currently, nested readers occur only when a timer handler interrupts a reader. This is rare, and is thus insufficient testing of the transition between nesting levels. This commit therefore causes rcutorture nested readers to be the rule rather than the exception. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcutorture: Sanitize RCUTORTURE_RDR_MASKPaul E. McKenney
RCUTORTURE_RDR_MASK is currently not the bit indicated by RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than that one. This is an accident waiting to happen, so this commit makes RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu-tasks: Don't remove tasks with pending IPIs from holdout listPaul E. McKenney
Currently, the check_all_holdout_tasks_trace() function removes all tasks marked with ->trc_reader_checked from the holdout list, including those with IPIs pending. This means that the IPI handler might arrive at a task that has already been removed from the list, which is at best an accident waiting to happen. This commit therefore avoids removing tasks with IPIs pending from the holdout list. This in turn means that the "if" condition in the for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always evaluate to false, so a WARN_ON_ONCE() is added to check that. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30srcu: Prevent redundant __srcu_read_unlock() wakeupPaul E. McKenney
Tiny SRCU readers can appear at task level, but also in interrupt and softirq handlers. Because Tiny SRCU is selected only in kernels built with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace period to start while there is a non-task-level SRCU reader executing. This means that it does not make sense for __srcu_read_unlock() to awaken the Tiny SRCU grace period, because that can only happen when the grace period is waiting for one value of ->srcu_idx and __srcu_read_unlock() is ending the last reader for some other value of ->srcu_idx. After all, any such wakeup will be redundant. Worse yet, in some cases, such wakeups generate lockdep splats: ====================================================== WARNING: possible circular locking dependency detected 5.15.0-rc1+ #3758 Not tainted ------------------------------------------------------ rcu_torture_rea/53 is trying to acquire lock: ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at: xa/0x30 but task is already holding lock: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&p->pi_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x2f/0x50 try_to_wake_up+0x50/0x580 swake_up_locked.part.7+0xe/0x30 swake_up_one+0x22/0x30 rcutorture_one_extend+0x1b6/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_timer+0x1a/0x70 call_timer_fn+0xa6/0x230 run_timer_softirq+0x493/0x4c0 __do_softirq+0xc0/0x371 irq_exit+0x73/0x90 sysvec_apic_timer_interrupt+0x63/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 default_idle+0xb/0x10 default_idle_call+0x5e/0x170 do_idle+0x18a/0x1f0 cpu_startup_entry+0xa/0x10 start_kernel+0x678/0x69f secondary_startup_64_no_verify+0xc2/0xcb -> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}: __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 _raw_spin_lock_irqsave+0x2f/0x50 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 kthread+0x12d/0x150 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); lock(&p->pi_lock); lock(srcu_ctl.srcu_wq.lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/53: #0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at: _extend+0x370/0x400 stack backtrace: CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+ Hardware name: Red Hat KVM/RHEL-AV, BIOS e_el8.5.0+746+bbd5d70c 04/01/2014 Call Trace: check_noncircular+0xfe/0x110 ? find_held_lock+0x2d/0x90 __lock_acquire+0x130c/0x2440 lock_acquire+0xc2/0x270 ? swake_up_one+0xa/0x30 ? find_held_lock+0x72/0x90 _raw_spin_lock_irqsave+0x2f/0x50 ? swake_up_one+0xa/0x30 swake_up_one+0xa/0x30 rcutorture_one_extend+0x387/0x400 rcu_torture_one_read+0x290/0x5d0 rcu_torture_reader+0xac/0x200 ? rcutorture_oom_notify+0xf0/0xf0 ? __kthread_parkme+0x61/0x90 ? rcu_torture_one_read+0x5d0/0x5d0 kthread+0x12d/0x150 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 This is a false positive because there is only one CPU, and both locks are raw (non-preemptible) spinlocks. However, it is worthwhile getting rid of the redundant wakeup, which has the side effect of breaking the theoretical deadlock cycle. This commit therefore eliminates the redundant wakeups. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Avoid alloc_pages() when recording stackJun Miao
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c9c7). Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc(). Jianwei Hu reported: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3 INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1 Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018 Call Trace: show_stack+0x52/0x58 dump_stack+0xa1/0xd6 ___might_sleep.cold+0x11c/0x12d rt_spin_lock+0x3f/0xc0 rmqueue+0x100/0x1460 rmqueue+0x100/0x1460 mark_usage+0x1a0/0x1a0 ftrace_graph_ret_addr+0x2a/0xb0 rmqueue_pcplist.constprop.0+0x6a0/0x6a0 __kasan_check_read+0x11/0x20 __zone_watermark_ok+0x114/0x270 get_page_from_freelist+0x148/0x630 is_module_text_address+0x32/0xa0 __alloc_pages_nodemask+0x2f6/0x790 __alloc_pages_slowpath.constprop.0+0x12d0/0x12d0 create_prof_cpu_mask+0x30/0x30 alloc_pages_current+0xb1/0x150 stack_depot_save+0x39f/0x490 kasan_save_stack+0x42/0x50 kasan_save_stack+0x23/0x50 kasan_record_aux_stack+0xa9/0xc0 __call_rcu+0xff/0x9c0 call_rcu+0xe/0x10 put_object+0x53/0x70 __delete_object+0x7b/0x90 kmemleak_free+0x46/0x70 slab_free_freelist_hook+0xb4/0x160 kfree+0xe5/0x420 kfree_const+0x17/0x30 kobject_cleanup+0xaa/0x230 kobject_put+0x76/0x90 netdev_queue_update_kobjects+0x17d/0x1f0 ... ... ksys_write+0xd9/0x180 __x64_sys_write+0x42/0x50 do_syscall_64+0x38/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642 Fixes: 84109ab58590 ("rcu: Record kvfree_call_rcu() call stack for KASAN") Fixes: 26e760c9a7c8 ("rcu: kasan: record and print call_rcu() call stack") Reported-by: Jianwei Hu <jianwei.hu@windriver.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Marco Elver <elver@google.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Jun Miao <jun.miao@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Avoid running boost kthreads on isolated CPUsZqiang
When the boost kthreads are created on systems with nohz_full CPUs, the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD). However, when the rcu_boost_kthread_setaffinity() is called, the original affinity will be changed and these kthreads can subsequently run on nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity() restrict these boost kthreads to housekeeping CPUs. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Improve tree_plugin.h comments and add code cleanupsZhouyi Zhou
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: in_irq() cleanupChangbin Du
This commit replaces the obsolete and ambiguous macro in_irq() with its shiny new in_hardirq() equivalent. Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Move rcu_needs_cpu() to tree.cPaul E. McKenney
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of the rcu_needs_cpu() function. This commit therefore moves this function from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30rcu: Remove the RCU_FAST_NO_HZ Kconfig optionPaul E. McKenney
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30clocksource: Reduce the default clocksource_watchdog() retries to 2Waiman Long
With the previous patch, there is an extra watchdog read in each retry. Now the total number of clocksource reads is increased to 4 per iteration. In order to avoid increasing the clock skew check overhead, the default maximum number of retries is reduced from 3 to 2 to maintain the same 12 clocksource reads in the worst case. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30clocksource: Avoid accidental unstable marking of clocksourcesWaiman Long
Since commit db3a34e17433 ("clocksource: Retry clock read if long delays detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), it is found that tsc clocksource fallback to hpet can sometimes happen on both Intel and AMD systems especially when they are running stressful benchmarking workloads. Of the 23 systems tested with a v5.14 kernel, 10 of them have switched to hpet clock source during the test run. The result of falling back to hpet is a drastic reduction of performance when running benchmarks. For example, the fio performance tests can drop up to 70% whereas the iperf3 performance can drop up to 80%. 4 hpet fallbacks happened during bootup. They were: [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable Other fallbacks happen when the systems were running stressful benchmarks. For example: [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), changed the skew margin from 100us to 50us. I think this is too small and can easily be exceeded when running some stressful workloads on a thermally stressed system. So it is switched back to 100us. Even a maximum skew margin of 100us may be too small in for some systems when booting up especially if those systems are under thermal stress. To eliminate the case that the large skew is due to the system being too busy slowing down the reading of both the watchdog and the clocksource, an extra consecutive read of watchdog clock is being done to check this. The consecutive watchdog read delay is compared against WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that the system is just too busy. A warning will be printed to the console and the clock skew check is skipped for this round. Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-11-30bpf: Change bpf_kallsyms_lookup_name size type to ARG_CONST_SIZE_OR_ZEROKumar Kartikeya Dwivedi
Andrii mentioned in [0] that switching to ARG_CONST_SIZE_OR_ZERO lets user avoid having to prove that string size at runtime is not zero and helps with not having to supress clang optimizations. [0]: https://lore.kernel.org/bpf/CAEf4BzZa_vhXB3c8atNcTS6=krQvC25H7K7c3WWZhM=27ro=Wg@mail.gmail.com Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211122235733.634914-2-memxor@gmail.com
2021-12-01genirq/generic_chip: Constify irq_generic_chip_opsRikard Falkeborn
The only usage of irq_generic_chip_ops is to pass its address to irq_domain_add_linear() which takes a pointer to const struct irq_domain_ops. Make it const to allow the compiler to put it in read-only memory. [ tglx: Fixed subject prefix ] Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20211130214043.1257585-1-rikard.falkeborn@gmail.com
2021-12-01sched: Snapshot thread flagsMark Rutland
Some thread flags can be set remotely, and so even when IRQs are disabled, the flags can change under our feet. Generally this is unlikely to cause a problem in practice, but it is somewhat unsound, and KCSAN will legitimately warn that there is a data race. To avoid such issues, a snapshot of the flags has to be taken prior to using them. Some places already use READ_ONCE() for that, others do not. Convert them all to the new flag accessor helpers. The READ_ONCE(ti->flags) .. cmpxchg(ti->flags) loop in set_nr_if_polling() is left as-is for clarity. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211129130653.2037928-4-mark.rutland@arm.com
2021-12-01entry: Snapshot thread flagsMark Rutland
Some thread flags can be set remotely, and so even when IRQs are disabled, the flags can change under our feet. Generally this is unlikely to cause a problem in practice, but it is somewhat unsound, and KCSAN will legitimately warn that there is a data race. To avoid such issues, a snapshot of the flags has to be taken prior to using them. Some places already use READ_ONCE() for that, others do not. Convert them all to the new flag accessor helpers. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211129130653.2037928-3-mark.rutland@arm.com
2021-11-30bpf: Add bpf_loop helperJoanne Koong
This patch adds the kernel-side and API changes for a new helper function, bpf_loop: long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx, u64 flags); where long (*callback_fn)(u32 index, void *ctx); bpf_loop invokes the "callback_fn" **nr_loops** times or until the callback_fn returns 1. The callback_fn can only return 0 or 1, and this is enforced by the verifier. The callback_fn index is zero-indexed. A few things to please note: ~ The "u64 flags" parameter is currently unused but is included in case a future use case for it arises. ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c), bpf_callback_t is used as the callback function cast. ~ A program can have nested bpf_loop calls but the program must still adhere to the verifier constraint of its stack depth (the stack depth cannot exceed MAX_BPF_STACK)) ~ Recursive callback_fns do not pass the verifier, due to the call stack for these being too deep. ~ The next patch will include the tests and benchmark Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
2021-11-30bpf, docs: Prune all references to "internal BPF"Christoph Hellwig
The eBPF name has completely taken over from eBPF in general usage for the actual eBPF representation, or BPF for any general in-kernel use. Prune all remaining references to "internal BPF". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
2021-11-30bpf: Remove a redundant comment on bpf_prog_freeChristoph Hellwig
The comment telling that the prog_free helper is freeing the program is not exactly useful, so just remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211119163215.971383-3-hch@lst.de
2021-11-29cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()Wei Yang
css_alloc() needs the parent css, while cgroup_css() gets current cgropu's css. So we are getting the wrong css during cgroup_init_subsys(). Fortunately, cgrp_dfl_root.cgrp's css is not set yet, so the value we pass to css_alloc() is NULL anyway. Let's pass NULL directly during init, since we know there is no parent yet. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-29block: remove the ->rq_disk field in struct requestChristoph Hellwig
Just use the disk attached to the request_queue instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29fork: move copy_io to block/blk-ioc.cChristoph Hellwig
Move the copying of the I/O context to the block layer as that is where we can use the proper low-level interfaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-28Merge tag 'sched-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix to ensure that there is no stale KASAN shadow state left on the idle task's stack when a CPU is brought up after it was brought down before" * tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset task stack state in bringup_cpu()
2021-11-28Merge tag 'perf-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Thomas Gleixner: "A single fix for perf to prevent it from sending SIGTRAP to another task from a trace point event as it's not possible to deliver a synchronous signal to a different task from there" * tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ignore sigtrap for tracepoints destined for other tasks
2021-11-28Merge tag 'locking-urgent-2021-11-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two regression fixes for reader writer semaphores: - Plug a race in the lock handoff which is caused by inconsistency of the reader and writer path and can lead to corruption of the underlying counter. - down_read_trylock() is suboptimal when the lock is contended and multiple readers trylock concurrently. That's due to the initial value being read non-atomically which results in at least two compare exchange loops. Making the initial readout atomic reduces this significantly. Whith 40 readers by 11% in a benchmark which enforces contention on mmap_sem" * tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Optimize down_read_trylock() under highly contended case locking/rwsem: Make handoff bit handling more consistent
2021-11-28Merge tag 'trace-v5.16-rc2-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull another tracing fix from Steven Rostedt: "Fix the fix of pid filtering The setting of the pid filtering flag tested the "trace only this pid" case twice, and ignored the "trace everything but this pid" case. The 5.15 kernel does things a little differently due to the new sparse pid mask introduced in 5.16, and as the bug was discovered running the 5.15 kernel, and the first fix was initially done for that kernel, that fix handled both cases (only pid and all but pid), but the forward port to 5.16 created this bug" * tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Test the 'Do not trace this pid' case in create event
2021-11-27tracing: Test the 'Do not trace this pid' case in create eventSteven Rostedt (VMware)
When creating a new event (via a module, kprobe, eprobe, etc), the descriptors that are created must add flags for pid filtering if an instance has pid filtering enabled, as the flags are used at the time the event is executed to know if pid filtering should be done or not. The "Only trace this pid" case was added, but a cut and paste error made that case checked twice, instead of checking the "Trace all but this pid" case. Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/ Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-27Merge tag 'trace-v5.16-rc2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two fixes to event pid filtering: - Make sure newly created events reflect the current state of pid filtering - Take pid filtering into account when recording trigger events. (Also clean up the if statement to be cleaner)" * tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix pid filtering when triggers are attached tracing: Check pid filtering when creating events
2021-11-26tracing: Fix pid filtering when triggers are attachedSteven Rostedt (VMware)
If a event is filtered by pid and a trigger that requires processing of the event to happen is a attached to the event, the discard portion does not take the pid filtering into account, and the event will then be recorded when it should not have been. Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ipa/ipa_main.c 8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop") 76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26Merge tag 'pm-5.16-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These address three issues in the intel_pstate driver and fix two problems related to hibernation. Specifics: - Make intel_pstate work correctly on Ice Lake server systems with out-of-band performance control enabled (Adamos Ttofari). - Fix EPP handling in intel_pstate during CPU offline and online in the active mode (Rafael Wysocki). - Make intel_pstate support ITMT on asymmetric systems with overclocking enabled (Srinivas Pandruvada). - Fix hibernation image saving when using the user space interface based on the snapshot special device file (Evan Green). - Make the hibernation code release the snapshot block device using the same mode that was used when acquiring it (Thomas Zeitlhofer)" * tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Fix snapshot partial write lengths PM: hibernate: use correct mode for swsusp_close() cpufreq: intel_pstate: ITMT support for overclocked system cpufreq: intel_pstate: Fix active mode offline/online EPP handling cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
2021-11-26tracing: Check pid filtering when creating eventsSteven Rostedt (VMware)
When pid filtering is activated in an instance, all of the events trace files for that instance has the PID_FILTER flag set. This determines whether or not pid filtering needs to be done on the event, otherwise the event is executed as normal. If pid filtering is enabled when an event is created (via a dynamic event or modules), its flag is not updated to reflect the current state, and the events are not filtered properly. Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-25futex: Remove futex_cmpxchg detectionArnd Bergmann
Now that all architectures have a working futex implementation in any configuration, remove the runtime detection code. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Vineet Gupta <vgupta@kernel.org> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20211026100432.1730393-2-arnd@kernel.org
2021-11-24PM: hibernate: Fix snapshot partial write lengthsEvan Green
snapshot_write() is inappropriately limiting the amount of data that can be written in cases where a partial page has already been written. For example, one would expect to be able to write 1 byte, then 4095 bytes to the snapshot device, and have both of those complete fully (since now we're aligned to a page again). But what ends up happening is we write 1 byte, then 4094/4095 bytes complete successfully. The reason is that simple_write_to_buffer()'s second argument is the total size of the buffer, not the size of the buffer minus the offset. Since simple_write_to_buffer() accounts for the offset in its implementation, snapshot_write() can just pass the full page size directly down. Signed-off-by: Evan Green <evgreen@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-11-24PM: hibernate: use correct mode for swsusp_close()Thomas Zeitlhofer
Commit 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") changed the opening mode of the block device to (FMODE_READ | FMODE_EXCL). In the corresponding calls to swsusp_close(), the mode is still just FMODE_READ which triggers the warning in blkdev_flush_mapping() on resume from hibernate. So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the device. Fixes: 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-11-24sched/scs: Reset task stack state in bringup_cpu()Mark Rutland
To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
2021-11-23tracing/uprobe: Fix uprobe_perf_open probes iterationJiri Olsa
Add missing 'tu' variable initialization in the probes loop, otherwise the head 'tu' is used instead of added probes. Link: https://lkml.kernel.org/r/20211123142801.182530-1-jolsa@kernel.org Cc: stable@vger.kernel.org Fixes: 99c9a923e97a ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-23sched/cpuacct: Make user/system times in cpuacct.stat more preciseAndrey Ryabinin
cpuacct.stat shows user time based on raw random precision tick based counters. Use cputime_addjust() to scale these values against the total runtime accounted by the scheduler, like we already do for user/system times in /proc/<pid>/stat. Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-4-arbn@yandex-team.com
2021-11-23sched/cpuacct: Fix user/system in shown cpuacct.usage*Andrey Ryabinin
cpuacct has 2 different ways of accounting and showing user and system times. The first one uses cpuacct_account_field() to account times and cpuacct.stat file to expose them. And this one seems to work ok. The second one is uses cpuacct_charge() function for accounting and set of cpuacct.usage* files to show times. Despite some attempts to fix it in the past it still doesn't work. Sometimes while running KVM guest the cpuacct_charge() accounts most of the guest time as system time. This doesn't match with user&system times shown in cpuacct.stat or proc/<pid>/stat. Demonstration: # git clone https://github.com/aryabinin/kvmsample # make # mkdir /sys/fs/cgroup/cpuacct/test # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks # ./kvmsample & # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done 1976535645 2979839428 3979832704 4983603153 5983604157 Use cpustats accounted in cpuacct_account_field() as the source of user/sys times for cpuacct.usage* files. Make cpuacct_charge() to account only summary execution time. Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com
2021-11-23cpuacct: Convert BUG_ON() to WARN_ON_ONCE()Andrey Ryabinin
Replace fatal BUG_ON() with more safe WARN_ON_ONCE() in cpuacct_cpuusage_read(). Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-2-arbn@yandex-team.com
2021-11-23cputime, cpuacct: Include guest time in user time in cpuacct.statAndrey Ryabinin
cpuacct.stat in no-root cgroups shows user time without guest time included int it. This doesn't match with user time shown in root cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat in the same way. Make account_guest_time() to add user time to cgroup's cpustat to fix this. Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com
2021-11-23perf: Ignore sigtrap for tracepoints destined for other tasksMarco Elver
syzbot reported that the warning in perf_sigtrap() fires, saying that the event's task does not match current: | WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 | Modules linked in: | CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 | RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline] | RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline] | RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 | ... | Call Trace: | <IRQ> | irq_work_single+0x106/0x220 kernel/irq_work.c:211 | irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242 | irq_work_run+0x4f/0xd0 kernel/irq_work.c:251 | __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22 | sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17 | </IRQ> | <TASK> | asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664 | RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline] | RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194 | ... | coredump_task_exit kernel/exit.c:371 [inline] | do_exit+0x1865/0x25c0 kernel/exit.c:771 | do_group_exit+0xe7/0x290 kernel/exit.c:929 | get_signal+0x3b0/0x1ce0 kernel/signal.c:2820 | arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 | handle_signal_work kernel/entry/common.c:148 [inline] | exit_to_user_mode_loop kernel/entry/common.c:172 [inline] | exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 | __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] | syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 | do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 | entry_SYSCALL_64_after_hwframe+0x44/0xae On x86 this shouldn't happen, which has arch_irq_work_raise(). The test program sets up a perf event with sigtrap set to fire on the 'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup(). This happened because the 'sched_wakeup' tracepoint also takes a task argument passed on to perf_tp_event(), which is used to deliver the event to that other task. Since we cannot deliver synchronous signals to other tasks, skip an event if perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is set, which will avoid ever entering perf_sigtrap() for such events. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
2021-11-23locking/rwsem: Optimize down_read_trylock() under highly contended caseMuchun Song
We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { *ptr = '\0'; ptr += 4096; } } int main(int argc, char* argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std::thread::hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
2021-11-23locking/rwsem: Make handoff bit handling more consistentWaiman Long
There are some inconsistency in the way that the handoff bit is being handled in readers and writers that lead to a race condition. Firstly, when a queue head writer set the handoff bit, it will clear it when the writer is being killed or interrupted on its way out without acquiring the lock. That is not the case for a queue head reader. The handoff bit will simply be inherited by the next waiter. Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both the waiter and handoff bits are cleared if the wait queue becomes empty. For rwsem_down_write_slowpath(), however, the handoff bit is not checked and cleared if the wait queue is empty. This can potentially make the handoff bit set with empty wait queue. Worse, the situation in rwsem_down_write_slowpath() relies on wstate, a variable set outside of the critical section containing the ->count manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF can be double subtracted, corrupting ->count. To make the handoff bit handling more consistent and robust, extract out handoff bit clearing code into the new rwsem_del_waiter() helper function. Also, completely eradicate wstate; always evaluate everything inside the same critical section. The common function will only use atomic_long_andnot() to clear bits when the wait queue is empty to avoid possible race condition. If the first waiter with handoff bit set is killed or interrupted to exit the slowpath without acquiring the lock, the next waiter will inherit the handoff bit. While at it, simplify the trylock for loop in rwsem_down_write_slowpath() to make it easier to read. Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation") Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
2021-11-22lsm: security_task_getsecid_subj() -> security_current_getsecid_subj()Paul Moore
The security_task_getsecid_subj() LSM hook invites misuse by allowing callers to specify a task even though the hook is only safe when the current task is referenced. Fix this by removing the task_struct argument to the hook, requiring LSM implementations to use the current task. While we are changing the hook declaration we also rename the function to security_current_getsecid_subj() in an effort to reinforce that the hook captures the subjective credentials of the current task and not an arbitrary task on the system. Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-11-19Merge tag 'trace-v5.16-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix double free in destroy_hist_field - Harden memset() of trace_iterator structure - Do not warn in trace printk check when test buffer fills up * tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Don't use out-of-sync va_list in event printing tracing: Use memset_startat() to zero struct trace_iterator tracing/histogram: Fix UAF in destroy_hist_field()
2021-11-19Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit-vs-signal handling fixes from Eric Biederman: "This is a small set of changes where debuggers were no longer able to intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit cleanups. This is essentially the change you suggested with all of i's dotted and the t's crossed so that ptrace can intercept all of the cases it has been able to intercept the past, and all of the cases that made it to exit without giving ptrace a chance still don't give ptrace a chance" * 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Replace force_fatal_sig with force_exit_sig when in doubt signal: Don't always set SA_IMMUTABLE for forced signals