summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-12-29ring-buffer: Fix wake ups when buffer_percent is set to 100Steven Rostedt (Google)
The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty = ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full == 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=" in the compare. Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-29sched/fair: Fix tg->load when offlining a CPUVincent Guittot
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups' load may remain and will negatively impact the calculation of the share of the online CPUs. To fix this bug, clear the contribution of an offlining CPU to task groups' load and skip its contribution while it is inactive. Here's the reproducer of the anomaly, by Imran Khan: "So far I have encountered only one rather lengthy way of reproducing this issue, which is as follows: 1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/ cpu/test_group_2 2. Assign a CPU intensive workload to each of these cgroups and start the workload. For my tests I am using following app: int main(int argc, char *argv[]) { unsigned long count, i, val; if (argc != 2) { printf("usage: ./a.out <number of random nums to generate> \n"); return 0; } count = strtoul(argv[1], NULL, 10); printf("Generating %lu random numbers \n", count); for (i = 0; i < count; i++) { val = rand(); val = val % 2; //usleep(1); } printf("Generated %lu random numbers \n", count); return 0; } Also since the system is booted with 4 CPUs, in order to completely load the system I am also launching 4 instances of same test app under: /sys/fs/cgroup/cpu/ 3. We can see that both of the cgroups get similar CPU time: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 659 - 5.5G - - /system.slice - - 5.7G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 31 - 56.5M - - Path Tasks %CPU Memory Input/s Output/s / 659 394.6 5.5G - - /test_group_2 3 65.7 - - - /user.slice 29 55.1 48.0M - - /test_group_1 4 47.3 - - - /system.slice - 2.2 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.8 5.5G - - /test_group_1 4 62.9 - - - /user.slice 28 44.9 54.2M - - /test_group_2 3 44.7 - - - /system.slice - 0.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.4 5.5G - - /test_group_2 3 58.8 - - - /test_group_1 4 51.9 - - - /user.slice 30 39.3 59.6M - - /system.slice - 1.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.7 5.5G - - /test_group_1 4 60.9 - - - /test_group_2 3 57.9 - - - /user.slice 28 43.5 36.9M - - /system.slice - 3.0 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 395.0 5.5G - - /test_group_1 4 66.8 - - - /test_group_2 3 56.3 - - - /user.slice 29 43.1 51.8M - - /system.slice - 0.7 5.7G - - 4. Now move systemd-udevd to one of these test groups, say test_group_1, and perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the host side. 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/test_group_2. It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting much less CPU time than the test_group_2, even though at this point of time both of these groups have only CPU hogger running: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 1219 - 5.4G - - /system.slice - - 5.6G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 26 - 91.3M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.3 5.4G - - /test_group_2 3 82.7 - - - /test_group_1 4 14.3 - - - /system.slice - 0.8 5.6G - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.6 5.4G - - /test_group_2 3 67.4 - - - /system.slice - 24.6 5.6G - - /test_group_1 4 12.5 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 60.9 - - - /system.slice - 27.9 5.6G - - /test_group_1 4 12.2 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 69.4 - - - /test_group_1 4 13.9 - - - /user.slice 28 1.6 92.0M - - /system.slice - 1.0 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.6 5.4G - - /test_group_2 3 59.3 - - - /test_group_1 4 14.1 - - - /user.slice 28 1.3 92.2M - - /system.slice - 0.7 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.5 5.4G - - /test_group_2 3 67.2 - - - /test_group_1 4 11.5 - - - /user.slice 28 1.3 92.5M - - /system.slice - 0.6 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.1 5.4G - - /test_group_2 3 76.8 - - - /test_group_1 4 12.9 - - - /user.slice 28 1.3 92.8M - - /system.slice - 1.2 5.6G - - From sched_debug data it can be seen that in bad case the load.weight of per-CPU sched entities corresponding to test_group_1 has reduced significantly and also load_avg of test_group_1 remains much higher than that of test_group_2, even though systemd-udevd stopped running long time back and at this point of time both cgroups just have the CPU hogger app as running entity." [ mingo: Added details from the original discussion, plus minor edits to the patch. ] Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
2023-12-27Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues or are not considered backporting material" * tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add an old address for Naoya Horiguchi mm/memory-failure: cast index to loff_t before shifting it mm/memory-failure: check the mapcount of the precise page mm/memory-failure: pass the folio and the page to collect_procs() selftests: secretmem: floor the memory size to the multiple of page_size mm: migrate high-order folios in swap cache correctly maple_tree: do not preallocate nodes for slot stores mm/filemap: avoid buffered read/write race to read inconsistent data kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset kexec: select CRYPTO from KEXEC_FILE instead of depending on it kexec: fix KEXEC_FILE dependencies
2023-12-27Kill sched.h dependency on rcupdate.hKent Overstreet
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big sched.h dependency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-27rseq: Split out rseq.h from sched.hKent Overstreet
We're trying to get sched.h down to more or less just types only, not code - rseq can live in its own header. This helps us kill the dependency on preempt.h in sched.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-23sched/fair: Remove unused 'next_buddy_marked' local variable in ↵Wang Jinchao
check_preempt_wakeup_fair() This variable became unused in: 5e963f2bd465 ("sched/fair: Commit to EEVDF") Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com
2023-12-23sched/fair: Use all little CPUs for CPU-bound workloadsPierre Gondois
Running N CPU-bound tasks on an N CPUs platform: - with asymmetric CPU capacity - not being a DynamIq system (i.e. having a PKG level sched domain without the SD_SHARE_PKG_RESOURCES flag set) .. might result in a task placement where two tasks run on a big CPU and none on a little CPU. This placement could be more optimal by using all CPUs. Testing platform: Juno-r2: - 2 big CPUs (1-2), maximum capacity of 1024 - 4 little CPUs (0,3-5), maximum capacity of 383 Testing workload ([1]): Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks is affine to a CPU, except for: - one little CPU which is left idle. - one big CPU which has 2 tasks affine. After the 100ms (step 2), remove the cpumask affinity. Behavior before the patch: During step 2, the load balancer running from the idle CPU tags sched domains as: - little CPUs: 'group_has_spare'. Cf. group_has_capacity() and group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs sched-domain, and the idle CPU provides enough spare capacity regarding the imbalance_pct - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs sched-domain, so the following path is used: group_is_overloaded() \-if (sgs->sum_nr_running <= sgs->group_weight) return true; The following path which would change the migration type to 'migrate_task' is not taken: calculate_imbalance() \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) as the local group has some spare capacity, so the imbalance is not 0. The migration type requested is 'migrate_util' and the busiest runqueue is the big CPU's runqueue having 2 tasks (each having a utilization of 512). The idle little CPU cannot pull one of these task as its capacity is too small for the task. The following path is used: detach_tasks() \-case migrate_util: \-if (util > env->imbalance) goto next; After the patch: As the number of failed balancing attempts grows (with 'nr_balance_failed'), progressively make it easier to migrate a big task to the idling little CPU. A similar mechanism is used for the 'migrate_load' migration type. Improvement: Running the testing workload [1] with the step 2 representing a ~10s load for a big CPU: Before patch: ~19.3s After patch: ~18s (-6.7%) Similar issue reported at: https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/ Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com
2023-12-23sched/fair: Simplify util_estVincent Guittot
With UTIL_EST_FASTUP now being permanent, we can take advantage of the fact that the ewma jumps directly to a higher utilization at dequeue to simplify util_est and remove the enqueued field. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Alex Shi <alexs@kernel.org> Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org
2023-12-23sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)Vincent Guittot
sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature in order to check for possibly related regressions. After 3 years, it has never been used and no regression has been reported. Let's remove it and make fast increase a permanent behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Yanteng Si <siyanteng@loongson.cn> [for the Chinese translation] Reviewed-by: Alex Shi <alexs@kernel.org> Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org
2023-12-23cpufreq/schedutil: Use a fixed reference frequencyVincent Guittot
cpuinfo.max_freq can change at runtime because of boost as an example. This implies that the value could be different than the one that has been used when computing the capacity of a CPU. The new arch_scale_freq_ref() returns a fixed and coherent reference frequency that can be used when computing a frequency based on utilization. Use this arch_scale_freq_ref() when available and fallback to policy otherwise. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/r/20231211104855.558096-4-vincent.guittot@linaro.org
2023-12-23Merge tag 'v6.7-rc6' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-12-21entry: Move syscall_enter_from_user_mode() to header fileSven Schnelle
To allow inlining of syscall_enter_from_user_mode(), move it to entry-common.h. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231218074520.1998026-4-svens@linux.ibm.com
2023-12-21entry: Move enter_from_user_mode() to header fileSven Schnelle
To allow inlining of enter_from_user_mode(), move it to entry-common.h. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231218074520.1998026-3-svens@linux.ibm.com
2023-12-21entry: Move exit to usermode functions to header fileSven Schnelle
To allow inlining, move exit_to_user_mode() to entry-common.h. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231218074520.1998026-2-svens@linux.ibm.com
2023-12-21bpf: Avoid unnecessary use of comma operator in verifierSimon Horman
Although it does not seem to have any untoward side-effects, the use of ';' to separate to assignments seems more appropriate than ','. Flagged by clang-17 -Wcomma No functional change intended. Compile tested only. Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/bpf/20231221-bpf-verifier-comma-v1-1-cde2530912e9@kernel.org
2023-12-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 23c93c3b6275 ("bnxt_en: do not map packet buffers twice") 6d1add95536b ("bnxt_en: Modify TX ring indexing logic.") tools/testing/selftests/net/Makefile 2258b666482d ("selftests: add vlan hw filter tests") a0bc96c0cd6e ("selftests: net: verify fq per-band packet limit") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-21Merge tag 'trace-v6.7-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix another kerneldoc warning - Fix eventfs files to inherit the ownership of its parent directory. The dynamic creation of dentries in eventfs did not take into account if the tracefs file system was mounted with a gid/uid, and would still default to the gid/uid of root. This is a regression. - Fix warning when synthetic event testing is enabled along with startup event tracing testing is enabled * tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing / synthetic: Disable events after testing in synth_event_gen_test_init() eventfs: Have event files and directories default to parent uid and gid tracing/synthetic: fix kernel-doc warnings
2023-12-21ring-buffer: Use subbuf_order for buffer page maskingSteven Rostedt (Google)
The comparisons to PAGE_SIZE were all converted to use the buffer->subbuf_order, but the use of PAGE_MASK was missed. Convert all the PAGE_MASK usages over to: (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1 Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: 139f84002145 ("ring-buffer: Page size per ring buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21tracing: Update subbuffer with kilobytes not page orderSteven Rostedt (Google)
Using page order for deciding what the size of the ring buffer sub buffers are is exposing a bit too much of the implementation. Although the sub buffers are only allocated in orders of pages, allow the user to specify the minimum size of each sub-buffer via kilobytes like they can with the buffer size itself. If the user specifies 3 via: echo 3 > buffer_subbuf_size_kb Then the sub-buffer size will round up to 4kb (on a 4kb page size system). If they specify: echo 6 > buffer_subbuf_size_kb The sub-buffer size will become 8kb. and so on. Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Just update the subbuffers when changing their allocation orderSteven Rostedt (Google)
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu cpu_buffers with the new subbuffers with the updated order, and if they all successfully were created, then they the ring_buffer's per_cpu buffers would be freed and replaced by them. The problem is that the freed per_cpu buffers contains state that would be lost. Running the following commands: 1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order 2. # echo 0 > /sys/kernel/tracing/tracing_cpumask 3. # echo 1 > /sys/kernel/tracing/snapshot 4. # echo ff > /sys/kernel/tracing/tracing_cpumask 5. # echo test > /sys/kernel/tracing/trace_marker Would result in: -bash: echo: write error: Bad file descriptor That's because the state of the per_cpu buffers of the snapshot buffer is lost when the order is changed (the order of a freed snapshot buffer goes to 0 to save memory, and when the snapshot buffer is allocated again, it goes back to what the main buffer is). In operation 2, the snapshot buffers were set to "disable" (as all the ring buffers CPUs were disabled). In operation 3, the snapshot is allocated and a call to ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the "record_disable" count. When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was decrementing a zero, setting it to -1. Writing 1 into the snapshot would swap the snapshot buffer with the main buffer, so now the main buffer is "disabled", and nothing can write to the ring buffer anymore. Instead of creating new per_cpu buffers and losing the state of the old buffers, basically do what the resize does and just allocate new subbuf pages into the new_pages link list of the per_cpu buffer and if they all succeed, then replace the old sub buffers with the new ones. This keeps the per_cpu buffer descriptor in tact and by doing so, keeps its state. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Keep the same size when updating the orderSteven Rostedt (Google)
The function ring_buffer_subbuf_order_set() just updated the sub-buffers to the new size, but this also changes the size of the buffer in doing so. As the size is determined by nr_pages * subbuf_size. If the subbuf_size is increased without decreasing the nr_pages, this causes the total size of the buffer to increase. This broke the latency tracers as the snapshot needs to be the same size as the main buffer. The size of the snapshot buffer is only expanded when needed, and because the order is still the same, the size becomes out of sync with the main buffer, as the main buffer increased in size without the tracing system knowing. Calculate the nr_pages to allocate with the new subbuf_size to be buffer_size / new_subbuf_size. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21tracing: Stop the tracing while changing the ring buffer subbuf sizeSteven Rostedt (Google)
Because the main buffer and the snapshot buffer need to be the same for some tracers, otherwise it will fail and disable all tracing, the tracers need to be stopped while updating the sub buffer sizes so that the tracers see the main and snapshot buffers with the same sub buffer size. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: 2808e31ec12e ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21tracing: Update snapshot order along with main buffer orderSteven Rostedt (Google)
When updating the order of the sub buffers for the main buffer, make sure that if the snapshot buffer exists, that it gets its order updated as well. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Make sure the spare sub buffer used for reads has same sizeSteven Rostedt (Google)
Now that the ring buffer specifies the size of its sub buffers, they all need to be the same size. When doing a read, a swap is done with a spare page. Make sure they are the same size before doing the swap, otherwise the read will fail. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Do no swap cpu buffers if order is differentSteven Rostedt (Google)
As all the subbuffer order (subbuffer sizes) must be the same throughout the ring buffer, check the order of the buffers that are doing a CPU buffer swap in ring_buffer_swap_cpu() to make sure they are the same. If the are not the same, then fail to do the swap, otherwise the ring buffer will think the CPU buffer has a specific subbuffer size when it does not. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failureSteven Rostedt (Google)
On failure to allocate ring buffer pages, the pointer to the CPU buffer pages is freed, but the pages that were allocated previously were not. Make sure they are freed too. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("tracing: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21tracing / synthetic: Disable events after testing in synth_event_gen_test_init()Steven Rostedt (Google)
The synth_event_gen_test module can be built in, if someone wants to run the tests at boot up and not have to load them. The synth_event_gen_test_init() function creates and enables the synthetic events and runs its tests. The synth_event_gen_test_exit() disables the events it created and destroys the events. If the module is builtin, the events are never disabled. The issue is, the events should be disable after the tests are run. This could be an issue if the rest of the boot up tests are enabled, as they expect the events to be in a known state before testing. That known state happens to be disabled. When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y a warning will trigger: Running tests on trace events: Testing event create_synth_test: Enabled event during self test! ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480 Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:event_trace_self_tests+0x1c2/0x480 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246 RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64 RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090 R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078 FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0 Call Trace: <TASK> ? __warn+0xa5/0x200 ? event_trace_self_tests+0x1c2/0x480 ? report_bug+0x1f6/0x220 ? handle_bug+0x6f/0x90 ? exc_invalid_op+0x17/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? tracer_preempt_on+0x78/0x1c0 ? event_trace_self_tests+0x1c2/0x480 ? __pfx_event_trace_self_tests_init+0x10/0x10 event_trace_self_tests_init+0x27/0xe0 do_one_initcall+0xd6/0x3c0 ? __pfx_do_one_initcall+0x10/0x10 ? kasan_set_track+0x25/0x30 ? rcu_is_watching+0x38/0x60 kernel_init_freeable+0x324/0x450 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1f/0x1e0 ? _raw_spin_unlock_irq+0x33/0x50 ret_from_fork+0x34/0x60 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This is because the synth_event_gen_test_init() left the synthetic events that it created enabled. By having it disable them after testing, the other selftests will run fine. Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reported-by: Alexander Graf <graf@amazon.com> Tested-by: Alexander Graf <graf@amazon.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21bpf: Re-support uid and gid when mounting bpffsDaniel Borkmann
For a clean, conflict-free revert of the token-related patches in commit d17aff807f84 ("Revert BPF token-related functionality"), the bpf fs commit 750e785796bb ("bpf: Support uid and gid when mounting bpffs") was undone temporarily as well. This patch manually re-adds the functionality from the original one back in 750e785796bb, no other functional changes intended. Testing: # mount -t bpf -o uid=65534,gid=65534 bpffs ./foo # ls -la . | grep foo drwxrwxrwt 2 nobody nogroup 0 Dec 20 13:16 foo # mount -t bpf bpffs on /root/foo type bpf (rw,relatime,uid=65534,gid=65534) Also, passing invalid arguments for uid/gid are properly rejected as expected. Fixes: d17aff807f84 ("Revert BPF token-related functionality") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Cc: Jie Jiang <jiejiang@chromium.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/bpf/20231220133805.20953-1-daniel@iogearbox.net
2023-12-21Merge branch 'vfs.file'Christian Brauner
Bring in the changes to the file infrastructure for this cycle. Mostly cleanups and some performance tweaks. * file: remove __receive_fd() * file: stop exposing receive_fd_user() * fs: replace f_rcuhead with f_task_work * file: remove pointless wrapper * file: s/close_fd_get_file()/file_close_fd()/g * Improve __fget_files_rcu() code generation (and thus __fget_light()) * file: massage cleanup of files that failed to open Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-21watch_queue: fix kcalloc() arguments orderDmitry Antipov
When compiling with gcc version 14.0.0 20231220 (experimental) and W=1, I've noticed the following warning: kernel/watch_queue.c: In function 'watch_queue_set_size': kernel/watch_queue.c:273:32: warning: 'kcalloc' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Wcalloc-transposed-args] 273 | pages = kcalloc(sizeof(struct page *), nr_pages, GFP_KERNEL); | ^~~~~~ Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to calculate the final size, their actual order doesn't affect the result and so this is not a bug. But it's still worth to fix it. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://lore.kernel.org/r/20231221090139.12579-1-dmantipov@yandex.ru Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-20posix-timers: Get rid of [COMPAT_]SYS_NI() usesLinus Torvalds
Only the posix timer system calls use this (when the posix timer support is disabled, which does not actually happen in any normal case), because they had debug code to print out a warning about missing system calls. Get rid of that special case, and just use the standard COND_SYSCALL interface that creates weak system call stubs that return -ENOSYS for when the system call does not exist. This fixes a kCFI issue with the SYS_NI() hackery: CFI failure at int80_emulation+0x67/0xb0 (target: sys_ni_posix_timers+0x0/0x70; expected type: 0xb02b34d9) WARNING: CPU: 0 PID: 48 at int80_emulation+0x67/0xb0 Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Tested-by: Sami Tolvanen <samitolvanen@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-20wait: Remove uapi header file from main header fileMatthew Wilcox (Oracle)
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so explicitly include it there and remove it from the main header file. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Christian Brauner <brauner@kernel.org>
2023-12-20plist: Split out plist_types.hKent Overstreet
Trimming down sched.h dependencies: we don't want to include more than the base types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20sched.h: move pid helpers to pid.hKent Overstreet
This is needed for killing the sched.h dependency on rcupdate.h, and pid.h is a better place for this code anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20kernel/numa.c: Move logging out of numa.hKent Overstreet
Moving these stub functions to a .c file means we can kill a sched.h dependency on printk.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20kernel/fork.c: add missing includeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20crash_core: remove duplicated including of kexec.hWang Jinchao
Remove second include of linux/kexec.h Link: https://lkml.kernel.org/r/202312151654+0800-wangjinchao@xfusion.com Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20kexec: use ALIGN macro instead of open-coding itYuntao Wang
Use ALIGN macro instead of open-coding it to improve code readability. Link: https://lkml.kernel.org/r/20231212142706.25149-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20fork: remove redundant TASK_UNINTERRUPTIBLEKevin Hao
TASK_KILLABLE already includes TASK_UNINTERRUPTIBLE, so there is no need to add a separate TASK_UNINTERRUPTIBLE. Link: https://lkml.kernel.org/r/20231208084115.1973285-1-haokexin@gmail.com Signed-off-by: Kevin Hao <haokexin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20kexec_file: print out debugging message if requiredBaoquan He
Then when specifying '-d' for kexec_file_load interface, loaded locations of kernel/initrd/cmdline etc can be printed out to help debug. Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file loading related codes. And also print out type/start/head of kimage and flags to help debug. Link: https://lkml.kernel.org/r/20231213055747.61826-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Conor Dooley <conor@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20kexec_file: add kexec_file flag to control debug printingBaoquan He
Patch series "kexec_file: print out debugging message if required", v4. Currently, specifying '-d' on kexec command will print a lot of debugging informationabout kexec/kdump loading with kexec_load interface. However, kexec_file_load prints nothing even though '-d' is specified. It's very inconvenient to debug or analyze the kexec/kdump loading when something wrong happened with kexec/kdump itself or develper want to check the kexec/kdump loading. In this patchset, a kexec_file flag is KEXEC_FILE_DEBUG added and checked in code. If it's passed in, debugging message of kexec_file code will be printed out and can be seen from console and dmesg. Otherwise, the debugging message is printed like beofre when pr_debug() is taken. Note: **** ===== 1) The code in kexec-tools utility also need be changed to support passing KEXEC_FILE_DEBUG to kernel when 'kexec -s -d' is specified. The patch link is here: ========= [PATCH] kexec_file: add kexec_file flag to support debug printing http://lists.infradead.org/pipermail/kexec/2023-November/028505.html 2) s390 also has kexec_file code, while I am not sure what debugging information is necessary. So leave it to s390 developer. Test: **** ==== Testing was done in v1 on x86_64 and arm64. For v4, tested on x86_64 again. And on x86_64, the printed messages look like below: -------------------------------------------------------------- kexec measurement buffer for the loaded kernel at 0x207fffe000. Loaded purgatory at 0x207fff9000 Loaded boot_param, command line and misc at 0x207fff3000 bufsz=0x1180 memsz=0x1180 Loaded 64bit kernel at 0x207c000000 bufsz=0xc88200 memsz=0x3c4a000 Loaded initrd at 0x2079e79000 bufsz=0x2186280 memsz=0x2186280 Final command line is: root=/dev/mapper/fedora_intel--knightslanding--lb--02-root ro rd.lvm.lv=fedora_intel-knightslanding-lb-02/root console=ttyS0,115200N81 crashkernel=256M E820 memmap: 0000000000000000-000000000009a3ff (1) 000000000009a400-000000000009ffff (2) 00000000000e0000-00000000000fffff (2) 0000000000100000-000000006ff83fff (1) 000000006ff84000-000000007ac50fff (2) ...... 000000207fff6150-000000207fff615f (128) 000000207fff6160-000000207fff714f (1) 000000207fff7150-000000207fff715f (128) 000000207fff7160-000000207fff814f (1) 000000207fff8150-000000207fff815f (128) 000000207fff8160-000000207fffffff (1) nr_segments = 5 segment[0]: buf=0x000000004e5ece74 bufsz=0x211 mem=0x207fffe000 memsz=0x1000 segment[1]: buf=0x000000009e871498 bufsz=0x4000 mem=0x207fff9000 memsz=0x5000 segment[2]: buf=0x00000000d879f1fe bufsz=0x1180 mem=0x207fff3000 memsz=0x2000 segment[3]: buf=0x000000001101cd86 bufsz=0xc88200 mem=0x207c000000 memsz=0x3c4a000 segment[4]: buf=0x00000000c6e38ac7 bufsz=0x2186280 mem=0x2079e79000 memsz=0x2187000 kexec_file_load: type:0, start:0x207fff91a0 head:0x109e004002 flags:0x8 --------------------------------------------------------------------------- This patch (of 7): When specifying 'kexec -c -d', kexec_load interface will print loading information, e.g the regions where kernel/initrd/purgatory/cmdline are put, the memmap passed to 2nd kernel taken as system RAM ranges, and printing all contents of struct kexec_segment, etc. These are very helpful for analyzing or positioning what's happening when kexec/kdump itself failed. The debugging printing for kexec_load interface is made in user space utility kexec-tools. Whereas, with kexec_file_load interface, 'kexec -s -d' print nothing. Because kexec_file code is mostly implemented in kernel space, and the debugging printing functionality is missed. It's not convenient when debugging kexec/kdump loading and jumping with kexec_file_load interface. Now add KEXEC_FILE_DEBUG to kexec_file flag to control the debugging message printing. And add global variable kexec_file_dbg_print and macro kexec_dprintk() to facilitate the printing. This is a preparation, later kexec_dprintk() will be used to replace the existing pr_debug(). Once 'kexec -s -d' is specified, it will print out kexec/kdump loading information. If '-d' is not specified, it regresses to pr_debug(). Link: https://lkml.kernel.org/r/20231213055747.61826-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20231213055747.61826-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Conor Dooley <conor@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20sync mm-stable with mm-hotfixes-stable to pick up depended-upon changesAndrew Morton
2023-12-20kexec: select CRYPTO from KEXEC_FILE instead of depending on itArnd Bergmann
All other users of crypto code use 'select' instead of 'depends on', so do the same thing with KEXEC_FILE for consistency. In practice this makes very little difference as kernels with kexec support are very likely to also include some other feature that already selects both crypto and crypto_sha256, but being consistent here helps for usability as well as to avoid potential circular dependencies. This reverts the dependency back to what it was originally before commit 74ca317c26a3f ("kexec: create a new config option CONFIG_KEXEC_FILE for new syscall"), which changed changed it with the comment "This should be safer as "select" is not recursive", but that appears to have been done in error, as "select" is indeed recursive, and there are no other dependencies that prevent CRYPTO_SHA256 from being selected here. Link: https://lkml.kernel.org/r/20231023110308.1202042-2-arnd@kernel.org Fixes: 74ca317c26a3f ("kexec: create a new config option CONFIG_KEXEC_FILE for new syscall") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com> Tested-by: Eric DeVolder <eric_devolder@yahoo.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Conor Dooley <conor@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20kexec: fix KEXEC_FILE dependenciesArnd Bergmann
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the 'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which causes a link failure when all the crypto support is in a loadable module and kexec_file support is built-in: x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load': (.text+0x32e30a): undefined reference to `crypto_alloc_shash' x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update' x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final' Both s390 and x86 have this problem, while ppc64 and riscv have the correct dependency already. On riscv, the dependency is only used for the purgatory, not for the kexec_file code itself, which may be a bit surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable KEXEC_FILE but then the purgatory code is silently left out. Move this into the common Kconfig.kexec file in a way that is correct everywhere, using the dependency on CRYPTO_SHA256=y only when the purgatory code is available. This requires reversing the dependency between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect remains the same, other than making riscv behave like the other ones. On s390, there is an additional dependency on CRYPTO_SHA256_S390, which should technically not be required but gives better performance. Remove this dependency here, noting that it was not present in the initial Kconfig code but was brought in without an explanation in commit 71406883fd357 ("s390/kexec_file: Add kexec_file_load system call"). [arnd@arndb.de: fix riscv build] Link: https://lkml.kernel.org/r/67ddd260-d424-4229-a815-e3fcfb864a77@app.fastmail.com Link: https://lkml.kernel.org/r/20231023110308.1202042-1-arnd@kernel.org Fixes: 6af5138083005 ("x86/kexec: refactor for kernel/Kconfig.kexec") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com> Tested-by: Eric DeVolder <eric_devolder@yahoo.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Conor Dooley <conor@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20bpf: Use c->unit_size to select target cache during freeHou Tao
At present, bpf memory allocator uses check_obj_size() to ensure that ksize() of allocated pointer is equal with the unit_size of used bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting a bpf_mem_cache which has different unit_size compared with the bpf_mem_cache used for allocation. But as reported by lkp, the return value of ksize() or kmalloc_size_roundup() may change due to slab merge and it will lead to the warning report in check_obj_size(). The reported warning happened as follows: (1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly. (2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is enabled by default for kmalloc slab, so align is 64 and size is 128 for kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will not be changed, because its align is 8 under x86-64. (3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node), ksize() returns 128 instead of 96 for the returned pointer. (4) the warning in check_obj_size() is triggered. Considering the slab merge can happen in anytime (e.g, a slab created in a new module), the following case is also possible: during the initialization of bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object returns 96. But after that a new slab created by a kernel module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab). So soon or later, when bpf_global_ma frees a 96-byte-sized pointer which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free() will free the pointer through a bpf_mem_cache in which unit_size is 128, because the return value of ksize() changes. The warning for the mismatch will be triggered again. A feasible fix is introducing similar APIs compared with ksize() and kmalloc_size_roundup() to return the actually-allocated size instead of size which may change due to slab merge, but it will introduce unnecessary dependency on the implementation details of mm subsystem. As for now the pointer of bpf_mem_cache is saved in the 8-bytes area (or 4-bytes under 32-bit host) above the returned pointer, using unit_size in the saved bpf_mem_cache to select the target cache instead of inferring the size from the pointer itself. Beside no extra dependency on mm subsystem, the performance for bpf_mem_free_rcu() is also improved as shown below. Before applying the patch, the performances of bpf_mem_alloc() and bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows: kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s percpu : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s After apply the patch, the performance for bpf_mem_free_rcu() increases 9% and 146% for kmalloc memory and per-cpu memory respectively: kmalloc: alloc 11.01 ± 0.03M/s free 32.42 ± 0.48M/s percpu: alloc 12.84 ± 0.12M/s free 35.24 ± 0.23M/s After the fixes, there is no need to adjust size_index to fix the mismatch between allocation and free, so remove it as well. Also return NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(), because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR. Fixes: 9077fc228f09 ("bpf: Use kmalloc_size_roundup() to adjust size_index") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-20PM: hibernate: Repair excess function parameter description warningRandy Dunlap
Function swsusp_close() does not have any parameters, so remove the description of parameter @exclusive to prevent this warning. swap.c:1573: warning: Excess function parameter 'exclusive' description in 'swsusp_close' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-20PM: sleep: Remove obsolete comment from unlock_system_sleep()Kevin Hao
With the freezer changes introduced by commit f5d39b020809 ("freezer,sched: Rewrite core freezer logic"), the comment in unlock_system_sleep() has become obsolete, there is no need to retain it. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-20tracing/synthetic: fix kernel-doc warningsRandy Dunlap
scripts/kernel-doc warns about using @args: for variadic arguments to functions. Documentation/doc-guide/kernel-doc.rst says that this should be written as @...: instead, so update the source code to match that, preventing the warnings. trace_events_synth.c:1165: warning: Excess function parameter 'args' description in '__synth_event_gen_cmd_start' trace_events_synth.c:1714: warning: Excess function parameter 'args' description in 'synth_event_trace' Link: https://lore.kernel.org/linux-trace-kernel/20231220061226.30962-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 35ca5207c2d11 ("tracing: Add synthetic event command generation functions") Fixes: 8dcc53ad956d2 ("tracing: Add synth_event_trace() and related functions") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20timers: Fix nextevt calculation when no timers are pendingAnna-Maria Behnsen
When no timer is queued into an empty timer base, the next_expiry will not be updated. It was originally calculated as base->clk + NEXT_TIMER_MAX_DELTA When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the next_expiry value of the empty base suggests that there is a timer pending soon. This might be more a kind of a theoretical problem, but the fix doesn't hurt. Use only base->next_expiry value as nextevt when timers are pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all information is in place, update base->next_expiry value of the empty timer base as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20231201092654.34614-13-anna-maria@linutronix.de
2023-12-20timers: Rework idle logicThomas Gleixner
To improve readability of the code, split base->idle calculation and expires calculation into separate parts. While at it, update the comment about timer base idle marking. Thereby the following subtle change happens if the next event is just one jiffy ahead and the tick was already stopped: Originally base->is_idle remains true in this situation. Now base->is_idle turns to false. This may spare an IPI if a timer is enqueued remotely to an idle CPU that is going to tick on the next jiffy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20231201092654.34614-12-anna-maria@linutronix.de