summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-08-15smp: print only local CPU info when sched_clock goes backwardRik van Riel
About 40% of all csd_lock warnings observed in our fleet appear to be due to sched_clock() going backward in time (usually only a little bit), resulting in ts0 being larger than ts2. When the local CPU is at fault, we should print out a message reflecting that, rather than trying to get the remote CPU's stack trace. Signed-off-by: Rik van Riel <riel@surriel.com> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15locking/csd-lock: Use backoff for repeated reports of same incidentPaul E. McKenney
Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y kernels are emitted at five-second intervals. Although this has proven to be a good time interval for the first diagnostic, if the target CPU keeps interrupts disabled for way longer than five seconds, the ratio of useful new information to pointless repetition increases considerably. Therefore, back off the time period for repeated reports of the same incident, increasing linearly with the number of reports and logarithmicly with the number of online CPUs. [ paulmck: Apply Dan Carpenter feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15locking/csd_lock: Provide an indication of ongoing CSD-lock stallPaul E. McKenney
If a CSD-lock stall goes on long enough, it will cause an RCU CPU stall warning. This additional warning provides much additional console-log traffic and little additional information. Therefore, provide a new csd_lock_is_stuck() function that returns true if there is an ongoing CSD-lock stall. This function will be used by the RCU CPU stall warnings to provide a one-line indication of the stall when this function returns true. [ neeraj.upadhyay: Apply Rik van Riel feedback. ] [ neeraj.upadhyay: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14Merge tag 'vfs-6.11-rc4.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Fix the name of file lease slab cache. When file leases were split out of file locks the name of the file lock slab cache was used for the file leases slab cache as well. - Fix a type in take_fd() helper. - Fix infinite directory iteration for stable offsets in tmpfs. - When the icache is pruned all reclaimable inodes are marked with I_FREEING and other processes that try to lookup such inodes will block. But some filesystems like ext4 can trigger lookups in their inode evict callback causing deadlocks. Ext4 does such lookups if the ea_inode feature is used whereby a separate inode may be used to store xattrs. Introduce I_LRU_ISOLATING which pins the inode while its pages are reclaimed. This avoids inode deletion during inode_lru_isolate() avoiding the deadlock and evict is made to wait until I_LRU_ISOLATING is done. netfs: - Fault in smaller chunks for non-large folio mappings for filesystems that haven't been converted to large folios yet. - Fix the CONFIG_NETFS_DEBUG config option. The config option was renamed a short while ago and that introduced two minor issues. First, it depended on CONFIG_NETFS whereas it wants to depend on CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter does. Second, the documentation for the config option wasn't fixed up. - Revert the removal of the PG_private_2 writeback flag as ceph is using it and fix how that flag is handled in netfs. - Fix DIO reads on 9p. A program watching a file on a 9p mount wouldn't see any changes in the size of the file being exported by the server if the file was changed directly in the source filesystem. Fix this by attempting to read the full size specified when a DIO read is requested. - Fix a NULL pointer dereference bug due to a data race where a cachefiles cookies was retired even though it was still in use. Check the cookie's n_accesses counter before discarding it. nsfs: - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as the kernel is writing to userspace. pidfs: - Prevent the creation of pidfds for kthreads until we have a use-case for it and we know the semantics we want. It also confuses userspace why they can get pidfds for kthreads. squashfs: - Fix an unitialized value bug reported by KMSAN caused by a corrupted symbolic link size read from disk. Check that the symbolic link size is not larger than expected" * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Squashfs: sanity check symbolic link size 9p: Fix DIO read through netfs vfs: Don't evict inode under the inode lru traversing context netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" file: fix typo in take_fd() comment pidfd: prevent creation of pidfds for kthreads netfs: clean up after renaming FSCACHE_DEBUG config libfs: fix infinite directory reads for offset dir nsfs: fix ioctl declaration fs/netfs/fscache_cookie: add missing "n_accesses" check filelock: fix name of file_lease slab cache netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14rcuscale: Print detailed grace-period and barrier diagnosticsPaul E. McKenney
This commit uses the new rcu_tasks_torture_stats_print(), rcu_tasks_trace_torture_stats_print(), and rcu_tasks_rude_torture_stats_print() functions in order to provide detailed diagnostics on grace-period, callback, and barrier state when rcu_scale_writer() hangs. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu: Mark callbacks not currently participating in barrier operationPaul E. McKenney
RCU keeps a count of the number of callbacks that the current rcu_barrier() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle RCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcuscale: Dump grace-period statistics when rcu_scale_writer() stallsPaul E. McKenney
This commit adds a .stats function pointer to the rcu_scale_ops structure, and if this is non-NULL, it is invoked after stack traces are dumped in response to a rcu_scale_writer() stall. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcuscale: Dump stacks of stalled rcu_scale_writer() instancesPaul E. McKenney
This commit improves debuggability by dumping the stacks of rcu_scale_writer() instances that have not completed in a reasonable timeframe. These stacks are dumped remotely, but they will be accurate in the thus-far common case where the stalled rcu_scale_writer() instances are blocked. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcuscale: Save a few lines with whitespace-only changePaul E. McKenney
This whitespace-only commit fuses a few lines of code, taking advantage of the newish 100-character-per-line limit to save a few lines of code. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14refscale: Optimize process_durations()Christophe JAILLET
process_durations() is not a hot path, but there is no good reason to iterate over and over the data already in 'buf'. Using a seq_buf saves some useless strcat() and the need of a temp buffer. Data is written directly at the correct place. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu/tasks: Add rcu_barrier_tasks*() start time to diagnosticsPaul E. McKenney
This commit adds the start time, in jiffies, of the most recently started rcu_barrier_tasks*() operation to the diagnostic output used by rcuscale. This information can be helpful in distinguishing a hung barrier operation from a long series of barrier operations. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu/tasks: Add detailed grace-period and barrier diagnosticsPaul E. McKenney
This commit adds rcu_tasks_torture_stats_print(), rcu_tasks_trace_torture_stats_print(), and rcu_tasks_rude_torture_stats_print() functions that provide detailed diagnostics on grace-period, callback, and barrier state. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu/tasks: Mark callbacks not currently participating in barrier operationPaul E. McKenney
Each Tasks RCU flavor keeps a count of the number of callbacks that the current rcu_barrier_tasks*() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle RCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu: Provide rcu_barrier_cb_is_done() to check rcu_barrier() CBsPaul E. McKenney
This commit provides a rcu_barrier_cb_is_done() function that returns true if the *rcu_barrier*() callback passed in is done. This will be used when printing grace-period debugging information. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu/tasks: Update rtp->tasks_gp_seq commentPaul E. McKenney
The rtp->tasks_gp_seq grace-period sequence number is not a strict count, but rather the usual RCU sequence number with the lower few bits tracking per-grace-period state and the upper bits the count of grace periods since boot, give or take the initial value. This commit therefore adjusts this comment. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu/tasks: Check processor-ID assumptionsPaul E. McKenney
The current mapping of smp_processor_id() to a CPU processing Tasks-RCU callbacks makes some assumptions about layout. This commit therefore adds a WARN_ON() to check these assumptions. [ neeraj.upadhyay: Replace nr_cpu_ids with rcu_task_cpu_ids. ] Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu-tasks: Fix access non-existent percpu rtpcp variable in ↵Zqiang
rcu_tasks_need_gpcb() For kernels built with CONFIG_FORCE_NR_CPUS=y, the nr_cpu_ids is defined as NR_CPUS instead of the number of possible cpus, this will cause the following system panic: smpboot: Allowing 4 CPUs, 0 hotplug CPUs ... setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:512 nr_node_ids:1 ... BUG: unable to handle page fault for address: ffffffff9911c8c8 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15 Comm: rcu_tasks_trace Tainted: G W 6.6.21 #1 5dc7acf91a5e8e9ac9dcfc35bee0245691283ea6 RIP: 0010:rcu_tasks_need_gpcb+0x25d/0x2c0 RSP: 0018:ffffa371c00a3e60 EFLAGS: 00010082 CR2: ffffffff9911c8c8 CR3: 000000040fa20005 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x23/0x80 ? page_fault_oops+0xa4/0x180 ? exc_page_fault+0x152/0x180 ? asm_exc_page_fault+0x26/0x40 ? rcu_tasks_need_gpcb+0x25d/0x2c0 ? __pfx_rcu_tasks_kthread+0x40/0x40 rcu_tasks_one_gp+0x69/0x180 rcu_tasks_kthread+0x94/0xc0 kthread+0xe8/0x140 ? __pfx_kthread+0x40/0x40 ret_from_fork+0x34/0x80 ? __pfx_kthread+0x40/0x40 ret_from_fork_asm+0x1b/0x80 </TASK> Considering that there may be holes in the CPU numbers, use the maximum possible cpu number, instead of nr_cpu_ids, for configuring enqueue and dequeue limits. [ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ] Closes: https://lore.kernel.org/linux-input/CALMA0xaTSMN+p4xUXkzrtR5r6k7hgoswcaXx7baR_z9r5jjskw@mail.gmail.com/T/#u Reported-by: Zhixu Liu <zhixu.liu@gmail.com> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu-tasks: Remove RCU Tasks Rude asynchronous APIsPaul E. McKenney
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. This commit therefore removes their definitions and boot-time self-tests. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcuscale: Stop testing RCU Tasks Rude asynchronous APIsPaul E. McKenney
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely once all architectures have their deep-idle and entry/exit code correctly marked as inline or noinstr. As a step towards this goal, this commit therefore removes these two functions from rcuscale testing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcutorture: Stop testing RCU Tasks Rude asynchronous APIsPaul E. McKenney
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely once all architectures have their deep-idle and entry/exit code correctly marked as inline or noinstr. As a first step towards this goal, this commit therefore removes these two functions from rcutorture testing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcutorture: Add a stall_cpu_repeat module parameterPaul E. McKenney
This commit adds an stall_cpu_repeat kernel, which is also the rcutorture.stall_cpu_repeat boot parameter, to test repeated CPU stalls. Note that only the first stall will pay attention to the stall_cpu_irqsoff module parameter. For the second and subsequent stalls, interrupts will be enabled. This is helpful when testing the interaction between RCU CPU stall warnings and CSD-lock stall warnings. Reported-by: Rik van Riel <riel@surriel.com> Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.Sebastian Andrzej Siewior
The two hrtimer_cpu_base_.*_expiry() functions are wrappers around the locking functions and sparse complains about the missing counterpart. Add sparse annotation to denote that this bevaviour is expected. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812105326.2240000-3-bigeasy@linutronix.de
2024-08-14timers: Add sparse annotation for timer_sync_wait_running().Sebastian Andrzej Siewior
timer_sync_wait_running() first releases two locks and then acquires them again. This is unexpected and sparse complains about it. Add sparse annotation for timer_sync_wait_running() to note that the locking is expected. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812105326.2240000-2-bigeasy@linutronix.de
2024-08-14perf: Really fix event_function_call() lockingNamhyung Kim
Commit 558abc7e3f89 ("perf: Fix event_function_call() locking") lost IRQ disabling by mistake. Fixes: 558abc7e3f89 ("perf: Fix event_function_call() locking") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-08-13sched_ext: Don't use double locking to migrate tasks across CPUsTejun Heo
consume_remote_task() and dispatch_to_local_dsq() use move_task_to_local_dsq() to migrate the task to the target CPU. Currently, move_task_to_local_dsq() expects the caller to lock both the source and destination rq's. While this may save a few lock operations while the rq's are not contended, under contention, the double locking can exacerbate the situation significantly (refer to the linked message below). Update the migration path so that double locking is not used. move_task_to_local_dsq() now expects the caller to be locking the source rq, drops it and then acquires the destination rq lock. Code is simpler this way and, on a 2-way NUMA machine w/ Xeon Gold 6138, 'hackbench 100 thread 5000` shows ~3% improvement with scx_simple. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20240806082716.GP37996@noisy.programming.kicks-ass.net Acked-by: David Vernet <void@manifault.com>
2024-08-13workqueue: Add interface for user-defined workqueue lockdep mapMatthew Brost
Add an interface for a user-defined workqueue lockdep map, which is helpful when multiple workqueues are created for the same purpose. This also helps avoid leaking lockdep maps on each workqueue creation. v2: - Add alloc_workqueue_lockdep_map (Tejun) v3: - Drop __WQ_USER_OWNED_LOCKDEP (Tejun) - static inline alloc_ordered_workqueue_lockdep_map (Tejun) Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-13workqueue: Change workqueue lockdep map to pointerMatthew Brost
Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-13workqueue: Split alloc_workqueue into internal function and lockdep initMatthew Brost
Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-13sched_ext: define missing cfi stubs for sched_extManu Bretelle
`__bpf_ops_sched_ext_ops` was missing the initialization of some struct attributes. With https://lore.kernel.org/all/20240722183049.2254692-4-martin.lau@linux.dev/ every single attributes need to be initialized programs (like scx_layered) will fail to load. 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero 05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524) 05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG -- attach to unsupported member dump of struct sched_ext_ops processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- 05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524 05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf' 05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524 Error: Failed to load BPF program Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-13perf/bpf: Don't call bpf_overflow_handler() for tracing eventsKyle Huey
The regressing commit is new in 6.10. It assumed that anytime event->prog is set bpf_overflow_handler() should be invoked to execute the attached bpf program. This assumption is false for tracing events, and as a result the regressing commit broke bpftrace by invoking the bpf handler with garbage inputs on overflow. Prior to the regression the overflow handlers formed a chain (of length 0, 1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog() (the tracing case) did not. Both set event->prog. The chain of overflow handlers was replaced by a single overflow handler slot and a fixed call to bpf_overflow_handler() when appropriate. This modifies the condition there to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the previous behavior and fixing bpftrace. Signed-off-by: Kyle Huey <khuey@kylehuey.com> Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Reported-by: Joe Damato <jdamato@fastly.com> Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/ Fixes: f11f10bfa1ca ("perf/bpf: Call BPF handler directly, not through overflow machinery") Cc: stable@vger.kernel.org Tested-by: Joe Damato <jdamato@fastly.com> # bpftrace Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240813151727.28797-1-jdamato@fastly.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-13printk/panic: Allow cpu backtraces to be written into ringbuffer during panicRyo Takakura
commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") disabled non-panic CPUs to further write messages to ringbuffer after panicked. Since the commit, non-panicked CPU's are not allowed to write to ring buffer after panicked and CPU backtrace which is triggered after panicked to sample non-panicked CPUs' backtrace no longer serves its function as it has nothing to print. Fix the issue by allowing non-panicked CPUs to write into ringbuffer while CPU backtrace is in flight. Fixes: 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-13irqdomain: Remove stray '-' in the domain nameAndy Shevchenko
When the domain suffix is not supplied alloc_fwnode_name() unconditionally adds a separator. Fix the format strings to get rid of the stray '-' separator. Fixes: 1e7c05292531 ("irqdomain: Allow giving name suffix for domain") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812193101.1266625-3-andriy.shevchenko@linux.intel.com
2024-08-13irqdomain: Clarify checks for bus_tokenAndy Shevchenko
The code uses if (bus_token) and if (bus_token == DOMAIN_BUS_ANY). Since bus_token is an enum, the latter is more robust against changes. Convert all !bus_token checks to explicitely check for DOMAIN_BUS_ANY. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240812193101.1266625-2-andriy.shevchenko@linux.intel.com
2024-08-12struct fd: representation changeAl Viro
We want the compiler to see that fdput() on empty instance is a no-op. The emptiness check is that file reference is NULL, while fdput() is "fput() if FDPUT_FPUT is present in flags". The reason why fdput() on empty instance is a no-op is something compiler can't see - it's that we never generate instances with NULL file reference combined with non-zero flags. It's not that hard to deal with - the real primitives behind fdget() et.al. are returning an unsigned long value, unpacked by (inlined) __to_fd() into the current struct file * + int. The lower bits are used to store flags, while the rest encodes the pointer. Linus suggested that keeping this unsigned long around with the extractions done by inlined accessors should generate a sane code and that turns out to be the case. Namely, turning struct fd into a struct-wrapped unsinged long, with fd_empty(f) => unlikely(f.word == 0) fd_file(f) => (struct file *)(f.word & ~3) fdput(f) => if (f.word & 1) fput(fd_file(f)) ends up with compiler doing the right thing. The cost is the patch footprint, of course - we need to switch f.file to fd_file(f) all over the tree, and it's not doable with simple search and replace; there are false positives, etc. Note that the sole member of that structure is an opaque unsigned long - all accesses should be done via wrappers and I don't want to use a name that would invite manual casts to file pointers, etc. The value of that member is equal either to (unsigned long)p | flags, p being an address of some struct file instance, or to 0 for an empty fd. For now the new predicate (fd_empty(f)) has no users; all the existing checks have form (!fd_file(f)). We will convert to fd_empty() use later; here we only define it (and tell the compiler that it's unlikely to return true). This commit only deals with representation change; there will be followups. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12bpf: Fix a kernel verifier crash in stacksafe()Yonghong Song
Daniel Hodges reported a kernel verifier crash when playing with sched-ext. Further investigation shows that the crash is due to invalid memory access in stacksafe(). More specifically, it is the following code: if (exact != NOT_EXACT && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; The 'i' iterates old->allocated_stack. If cur->allocated_stack < old->allocated_stack the out-of-bound access will happen. To fix the issue add 'i >= cur->allocated_stack' check such that if the condition is true, stacksafe() should fail. Otherwise, cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Cc: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Daniel Hodges <hodgesd@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240812214847.213612-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-13cpu/SMT: Enable SMT only if a core is onlineNysal Jan K.A
If a core is offline then enabling SMT should not online CPUs of this core. By enabling SMT, what is intended is either changing the SMT value from "off" to "on" or setting the SMT level (threads per core) from a lower to higher value. On PowerPC the ppc64_cpu utility can be used, among other things, to perform the following functions: ppc64_cpu --cores-on # Get the number of online cores ppc64_cpu --cores-on=X # Put exactly X cores online ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline ppc64_cpu --smt={on|off|value} # Enable, disable or change SMT level If the user has decided to offline certain cores, enabling SMT should not online CPUs in those cores. This patch fixes the issue and changes the behaviour as described, by introducing an arch specific function topology_is_core_online(). It is currently implemented only for PowerPC. Fixes: 73c58e7e1412 ("powerpc: Add HOTPLUG_SMT support") Reported-by: Tyrel Datwyler <tyreld@linux.ibm.com> Closes: https://groups.google.com/g/powerpc-utils-devel/c/wrwVzAAnRlI/m/5KJSoqP4BAAJ Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240731030126.956210-2-nysal@linux.ibm.com
2024-08-12pidfd: prevent creation of pidfds for kthreadsChristian Brauner
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads. Link: https://lore.kernel.org/r/20240731-gleis-mehreinnahmen-6bbadd128383@brauner Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12srcu: Mark callbacks not currently participating in barrier operationPaul E. McKenney
SRCU keeps a count of the number of callbacks that the current srcu_barrier() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle SRCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-12srcu: Check for concurrent updates of heuristicsPaul E. McKenney
SRCU maintains the ->srcu_n_exp_nodelay and ->reschedule_count values to guide heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays. This commit adds KCSAN ASSERT_EXCLUSIVE_WRITER() calls to check for concurrent updates to these fields. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-12srcu: faster gp seq wrap-aroundJP Kobryn
Using a higher value for the initial gp sequence counters allows for wrapping to occur faster. It can help with surfacing any issues that may be happening as a result of the wrap around. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-12rcu: Use system_unbound_wq to avoid disturbing isolated CPUsWaiman Long
It was discovered that isolated CPUs could sometimes be disturbed by kworkers processing kfree_rcu() works causing higher than expected latency. It is because the RCU core uses "system_wq" which doesn't have the WQ_UNBOUND flag to handle all its work items. Fix this violation of latency limits by using "system_unbound_wq" in the RCU core instead. This will ensure that those work items will not be run on CPUs marked as isolated. Beside the WQ_UNBOUND flag, the other major difference between system_wq and system_unbound_wq is their max_active count. The system_unbound_wq has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE. Reported-by: Vratislav Bendel <vbendel@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-50220 Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11Merge tag 'timers-urgent-2024-08-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time keeping fixes from Thomas Gleixner: - Fix a couple of issues in the NTP code where user supplied values are neither sanity checked nor clamped to the operating range. This results in integer overflows and eventualy NTP getting out of sync. According to the history the sanity checks had been removed in favor of clamping the values, but the clamping never worked correctly under all circumstances. The NTP people asked to not bring the sanity checks back as it might break existing applications. Make the clamping work correctly and add it where it's missing - If adjtimex() sets the clock it has to trigger the hrtimer subsystem so it can adjust and if the clock was set into the future expire timers if needed. The caller should provide a bitmask to tell hrtimers which clocks have been adjusted. adjtimex() uses not the proper constant and uses CLOCK_REALTIME instead, which is 0. So hrtimers adjusts only the clocks, but does not check for expired timers, which might make them expire really late. Use the proper bitmask constant instead. * tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex() ntp: Safeguard against time_constant overflow ntp: Clamp maxerror and esterror to operating range
2024-08-11Merge tag 'irq-urgent-2024-08-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "Three small fixes for interrupt core and drivers: - The interrupt core fails to honor caller supplied affinity hints for non-managed interrupts and uses the system default affinity on startup instead. Set the missing flag in the descriptor to tell the core to use the provided affinity. - Fix a shift out of bounds error in the Xilinx driver - Handle switching to level trigger correctly in the RISCV APLIC driver. It failed to retrigger the interrupt which causes it to become stale" * tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration irqchip/xilinx: Fix shift out of bounds genirq/irqdesc: Honor caller provided affinity in alloc_desc()
2024-08-11rcu: Annotate struct kvfree_rcu_bulk_data with __counted_by()Thorsten Blum
Add the __counted_by compiler attribute to the flexible array member records to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Increment nr_records before adding a new pointer to the records array. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11context_tracking, rcu: Rename DYNTICK_IRQ_NONIDLE into CT_NESTING_IRQ_NONIDLEValentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11context_tracking, rcu: Rename ct_dynticks_nmi_nesting_cpu() into ↵Valentin Schneider
ct_nmi_nesting_cpu() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11context_tracking, rcu: Rename ct_dynticks_nmi_nesting() into ct_nmi_nesting()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11context_tracking, rcu: Rename struct context_tracking .dynticks_nmi_nesting ↵Valentin Schneider
into .nmi_nesting The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. [ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ] Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-11context_tracking, rcu: Rename ct_dynticks_nesting_cpu() into ct_nesting_cpu()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>