summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-03-17include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.hJuri Lelli
dl_rebuild_rd_accounting() is defined in cpuset.c, so it makes more sense to move related declarations to cpuset.h. Implement the move. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSOVMpU7jpVrMU@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Stop exposing partition_sched_domains_lockedJuri Lelli
The are no callers of partition_sched_domains_locked() outside topology.c. Stop exposing such function. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSC96a8FcqWV3G@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Generalize unique visiting of root domainsJuri Lelli
Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Wrappers for sched_domains_mutexJuri Lelli
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-17perf: Clean up pmu specific dataKan Liang
The pmu specific data is saved in task_struct now. Remove it from event context structure. Remove swap_task_ctx() as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-7-kan.liang@linux.intel.com
2025-03-17sched: Add a generic function to return the preemption stringSebastian Andrzej Siewior
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die" case but not for regular warning. With the addition of DYNAMIC_PREEMPT for PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set simultaneously. That means that everyone who tried to add that piece of information gets it wrong for PREEMPT_RT because PREEMPT is checked first. Provide a generic function which returns the current scheduling model considering LAZY preempt and the current state of PREEMPT_DYNAMIC. The resulting strings are: ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │ └───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘ [ The dynamic building of the string can lead to an empty string if the function is invoked simultaneously on two CPUs. ] Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Co-developed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20250314160810.2373416-2-bigeasy@linutronix.de
2025-03-17perf: Supply task information to sched_task()Kan Liang
To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com
2025-03-17perf: attach/detach PMU specific dataKan Liang
The LBR call stack data has to be saved/restored during context switch to fix the shorter LBRs call stacks issue in the system-wide mode. Allocate PMU specific data and attach them to the corresponding task_struct during LBR call stack monitoring. When a LBR call stack event is accounted, the perf_ctx_data for the related tasks will be allocated/attached by attach_perf_ctx_data(). When a LBR call stack event is unaccounted, the perf_ctx_data for related tasks will be detached/freed by detach_perf_ctx_data(). The LBR call stack event could be a per-task event or a system-wide event. - For a per-task event, perf only allocates the perf_ctx_data for the current task. If the allocation fails, perf will error out. - For a system-wide event, perf has to allocate the perf_ctx_data for both the existing tasks and the upcoming tasks. The allocation for the existing tasks is done in perf_event_alloc(). If any allocation fails, perf will error out. The allocation for the new tasks will be done in perf_event_fork(). A global reader/writer semaphore, global_ctx_data_rwsem, is added to address the global race. - The perf_ctx_data only be freed by the last LBR call stack event. The number of the per-task events is tracked by refcount of each task. Since the system-wide events impact all tasks, it's not practical to go through the whole task list to update the refcount for each system-wide event. The number of system-wide events is tracked by a global variable global_ctx_data_ref. Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-3-kan.liang@linux.intel.com
2025-03-17locking/percpu-rwsem: Add guard supportPeter Zijlstra (Intel)
To simplify the usage of the percpu rw semaphore. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-2-kan.liang@linux.intel.com
2025-03-17perf: Save PMU specific data in task_structKan Liang
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com> Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com
2025-03-17perf: Extend per event callchain limit to branch stackKan Liang
The commit 97c79a38cd45 ("perf core: Per event callchain limit") introduced a per-event term to allow finer tuning of the depth of callchains to save space. It should be applied to the branch stack as well. For example, autoFDO collections require maximum LBR entries. In the meantime, other system-wide LBR users may only be interested in the latest a few number of LBRs. A per-event LBR depth would save the perf output buffer. The patch simply drops the uninterested branches, but HW still collects the maximum branches. There may be a model-specific optimization that can reduce the HW depth for some cases to reduce the overhead further. But it isn't included in the patch set. Because it's not useful for all cases. For example, ARCH LBR can utilize the PEBS and XSAVE to collect LBRs. The depth should have less impact on the collecting overhead. The model-specific optimization may be implemented later separately. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310181536.3645382-1-kan.liang@linux.intel.com
2025-03-17Merge tag 'v6.14-rc7' of ↵Bartosz Golaszewski
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into gpio/for-next Linux 6.14-rc7
2025-03-17mm: zpool: remove zpool_malloc_support_movable()Yosry Ahmed
zpool_malloc_support_movable() always returns true for zsmalloc, the only remaining zpool driver. Remove it and set the gfp flags in zswap_compress() accordingly. Opportunistically use GFP_NOWAIT instead of __GFP_NOWARN | __GFP_KSWAPD_RECLAIM for conciseness as they are equivalent. Link: https://lkml.kernel.org/r/20250305061134.4105762-6-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zsmalloc: remove object mapping APIs and per-CPU map areasYosry Ahmed
zs_map_object() and zs_unmap_object() are no longer used, remove them. Since these are the only users of per-CPU mapping_areas, remove them and the associated CPU hotplug callbacks too. [yosry.ahmed@linux.dev: update the docs] Link: https://lkml.kernel.org/r/Z8ier-ZZp8T6MOTH@google.com Link: https://lkml.kernel.org/r/20250305061134.4105762-5-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zpool: remove object mapping APIsYosry Ahmed
zpool_map_handle(), zpool_unmap_handle(), and zpool_can_sleep_mapped() are no longer used. Remove them with the underlying driver callbacks. Link: https://lkml.kernel.org/r/20250305061134.4105762-4-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zpool: add interfaces for object read/write APIsYosry Ahmed
Patch series "Switch zswap to object read/write APIs". This patch series updates zswap to use the new object read/write APIs defined by zsmalloc in [1], and remove the old object mapping APIs and the related code from zpool and zsmalloc. This patch (of 5): Zsmalloc introduced new APIs to read/write objects besides mapping them. Add the necessary zpool interfaces. Link: https://lkml.kernel.org/r/20250305061134.4105762-1-yosry.ahmed@linux.dev Link: https://lkml.kernel.org/r/20250305061134.4105762-2-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: add default allow/reject behavior fields to struct damosSeongJae Park
Current default allow/reject behavior of filters handling stage has made before introduction of the allow behavior. For allow-filters usage, it is confusing and inefficient. It is more intuitive to decide the default filtering stage allow/reject behavior as opposite to the last filter's behavior. The decision should be made separately for core and operations layers' filtering stages, since last core layer-handled filter is not really a last filter if there are operations layer handling filters. Keeping separate decisions for the two categories can make the logic simpler. Add fields for storing the two decisions. Link: https://lkml.kernel.org/r/20250304211913.53574-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: introduce damos->ops_filtersSeongJae Park
Patch series "mm/damon: make allow filters after reject filters useful and intuitive". DAMOS filters do allow or reject elements of memory for given DAMOS scheme only if those match the filter criterias. For elements that don't match any DAMOS filter, 'allowing' is the default behavior. This makes allow-filters that don't have any reject-filter after them meaningless sources of overhead. The decision was made to keep the behavior consistent with that before the introduction of allow-filters. This, however, makes usage of DAMOS filters confusing and inefficient. It is more intuitive and still consistent behavior to reject by default unless there is no filter at all or the last filter is a reject filter. Update the filtering logic in the way and update documents to clarify the behavior. Note that this is changing the old behavior. But the old behavior for the problematic filter combination was definitely confusing, inefficient and anyway useless. Also, the behavior has relatively recently introduced. It is difficult to anticipate any user that depends on the behavior. Hence this is not a user-breaking behavior change but an obvious improvement. This patch (of 9): DAMOS filters can be categorized into two groups depending on which layer they are handled, namely core layer and ops layer. The groups are important because the filtering behavior depends on evaluation sequence of filters, and core layer-handled filters are evaluated before operations layer-handled ones. The behavior is clearly documented, but the implementation is bit inefficient and complicated. All filters are maintained in a single list (damos->filters) in mix. Filters evaluation logics in core layer and operations layer iterates all the filters on the list, while skipping filters that should be not handled by the layer of the logic. It is inefficient. Making future extensions having differentiations for filters of different handling layers will also be complicated. Add a new list that will be used for having all operations layer-handled DAMOS filters to DAMOS scheme data structure. Also add the support of its initialization and basic traversal functions. Link: https://lkml.kernel.org/r/20250304211913.53574-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250304211913.53574-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17writeback: fix calculations in trace_balance_dirty_pages() for cgwbTang Yizhou
In the commit dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") of the cgroup writeback backpressure propagation patchset, Tejun made some adaptations to trace_balance_dirty_pages() for cgroup writeback. However, this adaptation was incomplete and Tejun missed further adaptation in the subsequent patches. In the cgroup writeback scenario, if sdtc in balance_dirty_pages() is assigned to mdtc, then upon entering trace_balance_dirty_pages(), __entry->limit should be assigned based on the dirty_limit of the corresponding memcg's wb_domain, rather than global_wb_domain. To address this issue and simplify the implementation, introduce a 'limit' field in struct dirty_throttle_control to store the hard_limit value computed in wb_position_ratio() by calling hard_dirty_limit(). This field will then be used in trace_balance_dirty_pages() to assign the value to __entry->limit. Link: https://lkml.kernel.org/r/20250304110318.159567-4-yizhou.tang@shopee.com Fixes: dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17writeback: let trace_balance_dirty_pages() take struct dtc as parameterTang Yizhou
Patch series "Fix calculations in trace_balance_dirty_pages() for cgwb", v2. In my experiment, I found that the output of trace_balance_dirty_pages() in the cgroup writeback scenario was strange because trace_balance_dirty_pages() always uses global_wb_domain.dirty_limit for related calculations instead of the dirty_limit of the corresponding memcg's wb_domain. The basic idea of the fix is to store the hard dirty limit value computed in wb_position_ratio() into struct dirty_throttle_control and use it for calculations in trace_balance_dirty_pages(). This patch (of 3): Currently, trace_balance_dirty_pages() already has 12 parameters. In the patch #3, I initially attempted to introduce an additional parameter. However, in include/linux/trace_events.h, bpf_trace_run12() only supports up to 12 parameters and bpf_trace_run13() does not exist. To reduce the number of parameters in trace_balance_dirty_pages(), we can make it accept a pointer to struct dirty_throttle_control as a parameter. To achieve this, we need to move the definition of struct dirty_throttle_control from mm/page-writeback.c to include/linux/writeback.h. Link: https://lkml.kernel.org/r/20250304110318.159567-1-yizhou.tang@shopee.com Link: https://lkml.kernel.org/r/20250304110318.159567-2-yizhou.tang@shopee.com Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Tang Yizhou <yizhou.tang@shopee.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17page_counter: reduce struct page_counter sizeShakeel Butt
The struct page_counter has explicit padding for better cache alignment. The commit c6f53ed8f213a ("mm, memcg: cg2 memory{.swap,}.peak write handlers") added a field to the struct page_counter and accidently increased its size. Let's move the failcnt field which is v1-only field to the same cacheline of usage to reduce the size of struct page_counter. Link: https://lkml.kernel.org/r/20250228075808.207484-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17page_counter: track failcnt only for legacy cgroupsShakeel Butt
Currently page_counter tracks failcnt for counters used by v1 and v2 controllers. However failcnt is only exported for v1 deployment and thus there is no need to maintain it in v2. The oom report does expose failcnt for memory and swap in v2 but v2 already maintains MEMCG_MAX and MEMCG_SWAP_MAX event counters which can be used. Link: https://lkml.kernel.org/r/20250228075808.207484-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: fix lazy mmu docs and usageRyan Roberts
Patch series "Fix lazy mmu mode", v2. I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table walkers. While lazy mmu mode is already used for kernel mappings in a few places, this will extend it's use significantly. Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86, it looks like there are a bunch of bugs, some of which may be more likely to trigger once I extend the use of lazy mmu. So this series attempts to clarify the requirements and fix all the bugs in advance of that series. See patch #1 commit log for all the details. This patch (of 5): The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode() is a bit of a mess (to put it politely). There are a number of issues related to nesting of lazy mmu regions and confusion over whether the task, when in a lazy mmu region, is preemptible or not. Fix all the issues relating to the core-mm. Follow up commits will fix the arch-specific implementations. 3 arches implement lazy mmu; powerpc, sparc and x86. When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit 6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was expected that lazy mmu regions would never nest and that the appropriate page table lock(s) would be held while in the region, thus ensuring the region is non-preemptible. Additionally lazy mmu regions were only used during manipulation of user mappings. Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates") started invoking the lazy mmu mode in apply_to_pte_range(), which is used for both user and kernel mappings. For kernel mappings the region is no longer protected by any lock so there is no longer any guarantee about non-preemptibility. Additionally, for RT configs, the holding the PTL only implies no CPU migration, it doesn't prevent preemption. Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added arch_[enter|leave]_lazy_mmu_mode() to the default implementation of set_ptes(), used by x86. So after this commit, lazy mmu regions can be nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new page table range API") and commit 9fee28baa601 ("powerpc: implement the new page table range API") did the same for the sparc and powerpc set_ptes() overrides. powerpc couldn't deal with preemption so avoids it in commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"), which explicitly disables preemption for the whole region in its implementation. x86 can support preemption (or at least it could until it tried to add support nesting; more on this below). Sparc looks to be totally broken in the face of preemption, as far as I can tell. powerpc can't deal with nesting, so avoids it in commit 47b8def9358c ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"), which removes the lazy mmu calls from its implementation of set_ptes(). x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow nesting of same lazy mode") but as far as I can tell, this breaks its support for preemption. In short, it's all a mess; the semantics for arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a result the implementations all have different expectations, sticking plasters and bugs. arm64 is aiming to start using these hooks, so let's clean everything up before adding an arm64 implementation. Update the documentation to state that lazy mmu regions can never be nested, must not be called in interrupt context and preemption may or may not be enabled for the duration of the region. And fix the generic implementation of set_ptes() to avoid nesting. arch-specific fixes to conform to the new spec will proceed this one. These issues were spotted by code review and I have no evidence of issues being reported in the wild. Link: https://lkml.kernel.org/r/20250303141542.3371656-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20250303141542.3371656-2-ryan.roberts@arm.com Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juegren Gross <jgross@suse.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: implement intervals auto-tuningSeongJae Park
Implement the DAMON sampling and aggregation intervals auto-tuning mechanism as briefly described on 'struct damon_intervals_goal'. The core part for deciding the direction and amount of the changes is implemented reusing the feedback loop function which is being used for DAMOS quotas auto-tuning. Unlike the DAMOS quotas auto-tuning use case, limit the maximum decreasing amount after the adjustment to 50% of the current value, though. This is because the intervals have no good merits at rapid reductions since it could unnecessarily increase the monitoring overhead. Link: https://lkml.kernel.org/r/20250303221726.484227-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: add data structure for monitoring intervals auto-tuningSeongJae Park
Patch series "mm/damon: auto-tune aggregation interval". DAMON requires time-consuming and repetitive aggregation interval tuning. Introduce a feature for automating it using a feedback loop that aims an amount of observed access events, like auto-exposing cameras. Background: Access Frequency Monitoring and Aggregation Interval ================================================================ DAMON checks if each memory element (damon_region) is accessed or not for every user-specified time interval called 'sampling interval'. It aggregates the check intervals on per-element counter called 'nr_accesses'. DAMON users can read the counters to get the access temperature of a given element. The counters are reset for every another user-specified time interval called 'aggregation interval'. This can be illustrated as DAMON continuously capturing a snapshot of access events that happen and captured within the last aggregation interval. This implies the aggregation interval plays a key role for the quality of the snapshots, like the camera exposure time. If it is too short, the amount of access events that happened and captured for each snapshot is small, so each snapshot will show no many interesting things but just a cold and dark world with hopefuly one pale blue dot or two. If it is too long, too many events are aggregated in a single shot, so each snapshot will look like world of flames, or Muspellheim. It will be difficult to find practical insights in both cases. Problem: Time Consuming and Repetitive Tuning ============================================= The appropriate length of the aggregation interval depends on how frequently the system and workloads are making access events that DAMON can observe. Hence, users have to tune the interval with excessive amount of tests with the target system and workloads. If the system and workloads are changed, the tuning should be done again. If the characteristic of the workloads is dynamic, it becomes more challenging. It is therefore time-consuming and repetitive. The tuning challenge mainly stems from the wrong question. It is not asking users what quality of monitoring results they want, but how DAMON should operate for their hidden goal. To make the right answer, users need to fully understand DAMON's mechanisms and the characteristics of their workloads. Users shouldn't be asked to understand the underlying mechanism. Understanding the characteristics of the workloads shouldn't be the role of users but DAMON. Aim-oriented Feedback-driven Auto-Tuning ========================================= Fortunately, the appropriate length of the aggregation interval can be inferred using a feedback loop. If the current snapshots are showing no much intresting information, in other words, if it shows only rare access events, increasing the aggregation interval helps, and vice versa. We tested this theory on a few real-world workloads, and documented one of the experience with an official DAMON monitoring intervals tuning guideline. Since it is a simple theory that requires repeatable tries, it can be a good job for machines. Based on the guideline's theory, we design an automation of aggregation interval tuning, in a way similar to that of camera auto-exposure feature. It defines the amount of interesting information as the ratio of DAMON-observed access events that DAMON actually observed to theoretical maximum amount of it within each snapshot. Events are accounted in byte and sampling attempts granularity. For example, let's say there is a region of 'X' bytes size. DAMON tried access check smapling for the region 'Y' times in total for a given aggregation. Among the 'Y' attempts, 'Z' times it shown positive results. Then, the theoritical maximum number of access events for the region is 'X * Y'. And the number of access events that DAMON has observed for the region is 'X * Z'. The abount of the interesting information is '(X * Z / X * Y)'. Note that each snapshot would have multiple regions. Users can set an arbitrary value of the ratio as their target. Once the target is set, the automation periodically measures the current value of the ratio and increase or decrease the aggregation interval if the ratio value is lower or higher than the target. The amount of the change is proportion to the distance between the current adn the target values. To avoid auto-tuning goes too long way, let users set the minimum and the maximum aggregation interval times. Changing only aggregation interval while sampling interval is kept makes the maximum level of access frequency in each snapshot, or discernment of regions inconsistent. Also, unnecessarily short sampling interval causes meaningless monitoring overhed. The automation therefore adjusts the sampling interval together with aggregation interval, while keeping the ratio between the two intervals. Users can set the ratio, or the discernment. Discussion ========== The modified question (aimed amount of access events, or lights, in each snapshot) is easy to answer by both the users and the kernel. If users are interested in finding more cold regions, the value should be lower, and vice versa. If users have no idea, kernel can suggest a fair default value based on some theories and experiments. For example, based on the Pareto principle (80/20 rule), we could expect 20% target ratio will capture 80% of real access events. Since 80% might be too high, applying the rule once again, 4% (20% * 20%) may capture about 56% (80% * 80%) of real access events. Sampling to aggregation intervals ratio and min/max aggregation intervals are also arguably easy to answer. What users want is discernment of regions for efficient system operation, for examples, X amount of colder regions or Y amount of warmer regions, not exactly how many times each cache line is accessed in nanoseconds degree. The appropriate min/max aggregation interval can relatively naively set, and may better to set for aimed monitoring overhead. Since sampling interval is directly deciding the overhead, setting it based on the sampling interval can be easy. With my experiences, I'd argue the intervals ratio 0.05, and 5 milliseconds to 20 seconds sampling interval range (100 milliseconds to 400 seconds aggregation interval) can be a good default suggestion. Evaluation ========== On a machine running a real world server workload, I ran DAMON to monitor its physical address space for about 23 hours, with this feature turned on. We set it to tune sampling interval in a range from 5 milliseconds to 10 seconds, aiming 4 % DAMON-observed access ratio per three aggregation intervals. The exact command I used is as below. damo start --monitoring_intervals_goal 4% 3 5ms 10s --damos_action stat During the test run, DAMON continuously updated sampling and aggregation intervals as designed, within the given range. For all the time, DAMON was able to find the intervals that meets the target access events ratio in the given intervals range (sampling interval between 5 milliseconds and 10 seconds). For most of the time, tuned sampling interval was converged in 300-400 milliseconds. It made only small amount of changes within the range. The average of the tuned sampling interval during the test was about 380 milliseconds. The workload periodically gets less load and decreases its CPU usage. Presumably this also caused it making less memory access events. Reactively to such event,s DAMON also increased the intervals as expected. It was still able to find the optimum interval that satisfying the target access ratio within the given intervals range. Usually it was converged to about 5 seconds. Once the workload gets normal amount of load again, DAMON reactively reduced the intervals to the normal range. I collected and visualized DAMON's monitoring results on the server a few times. Every time the visualized access pattern looked not biased to only cold or hot pages but diverse and balanced. Let me show some of the snapshots that I collected at the nearly end of the test (after about 23 hours have passed since starting DAMON on the server). The recency histogram looks as below. Please note that this visualization shows only a very coarse grained information. For more details about the visualization format, please refer to DAMON user-space tool documentation[1]. # ./damo report access --style recency-sz-hist --tried_regions_of 0 0 0 --access_rate 0 0 <last accessed time (us)> <total size> [-19 h 7 m 45.514 s, -17 h 12 m 58.963 s) 6.198 GiB |**** | [-17 h 12 m 58.963 s, -15 h 18 m 12.412 s) 0 B | | [-15 h 18 m 12.412 s, -13 h 23 m 25.860 s) 0 B | | [-13 h 23 m 25.860 s, -11 h 28 m 39.309 s) 0 B | | [-11 h 28 m 39.309 s, -9 h 33 m 52.757 s) 0 B | | [-9 h 33 m 52.757 s, -7 h 39 m 6.206 s) 0 B | | [-7 h 39 m 6.206 s, -5 h 44 m 19.654 s) 0 B | | [-5 h 44 m 19.654 s, -3 h 49 m 33.103 s) 0 B | | [-3 h 49 m 33.103 s, -1 h 54 m 46.551 s) 0 B | | [-1 h 54 m 46.551 s, -0 ns) 16.967 GiB |********* | [-0 ns, --6886551440000 ns) 38.835 GiB |********************| memory bw estimate: 9.425 GiB per second total size: 62.000 GiB It shows about 38 GiB of memory was accessed at least once within last aggregation interval (given ~300 milliseconds tuned sampling interval, this is about six seconds). This is about 61 % of the total memory. In other words, DAMON found warmest 61 % memory of the system. The number is particularly interesting given our Pareto principle based theory for the tuning goal value. We set it as 20 % of 20 % (4 %), thinking it would capture 80 % of 80 % (64 %) real access events. And it foudn 61 % hot memory, or working set. Nevertheless, to make the theory clearer, much more discussion and tests would be needed. At the moment, nonetheless, we can say making the target value higher helps finding more hot memory regions. The histogram also shows an amount of cold memory. About 17 GiB memory of the system has not accessed at least for last aggregation interval (about six seconds), and at most for about last two hours. The real longest unaccessed time of the 17 GiB memory was about 19 minutes, though. This is a limitation of this visualization format. It further found very cold 6 GiB memory. It has not accessed at least for last 17 hours and at most 19 hours. What about hot memory distribution? To see this, I capture and visualize the snapshot in access temperature histogram. Again, please refer to the DAMON user-space tool documentation[1] for the format and what access temperature mean. Both the visualization and metric shows only very coarse grained and limited information. The resulting histogram look like below. # ./damo report access --style temperature-sz-hist --tried_regions_of 0 0 0 <temperature> <total size> [-6,840,763,776,000, -5,501,580,939,800) 6.198 GiB |*** | [-5,501,580,939,800, -4,162,398,103,600) 0 B | | [-4,162,398,103,600, -2,823,215,267,400) 0 B | | [-2,823,215,267,400, -1,484,032,431,200) 0 B | | [-1,484,032,431,200, -144,849,595,000) 0 B | | [-144,849,595,000, 1,194,333,241,200) 55.802 GiB |********************| [1,194,333,241,200, 2,533,516,077,400) 4.000 KiB |* | [2,533,516,077,400, 3,872,698,913,600) 4.000 KiB |* | [3,872,698,913,600, 5,211,881,749,800) 8.000 KiB |* | [5,211,881,749,800, 6,551,064,586,000) 12.000 KiB |* | [6,551,064,586,000, 7,890,247,422,200) 4.000 KiB |* | memory bw estimate: 5.178 GiB per second total size: 62.000 GiB We can see most of the memory is in similar access temperature range, and definitely some pages are extremely hot. To see the picture in more detail, let's capture and visualize the snapshot per DAMON-region, sorted by their access temperature. The total number of the regions was about 300. Due to the limited space, I'm showing only a few parts of the output here. # ./damo report access --style hot --tried_regions_of 0 0 0 heatmap: 00000000888888889999999888888888888888888888888888888888888888888888888888888888 # min/max temperatures: -6,827,258,184,000, 17,589,052,500, column size: 793.600 MiB |999999999999999999999999999999999999999| 4.000 KiB access 100 % 18 h 9 m 43.918 s |999999999999999999999999999999999999999| 8.000 KiB access 100 % 17 h 56 m 5.351 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 15 h 24 m 19.634 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 14 h 10 m 55.606 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 11 h 34 m 18.993 s [...] |99999999999999999999999999999| 8.000 KiB access 100 % 1 m 27.945 s |11111111111111111111111111111| 80.000 KiB access 15 % 1 m 21.180 s |00000000000000000000000000000| 24.000 KiB access 5 % 1 m 21.180 s |00000000000000000000000000000| 5.919 GiB access 10 % 1 m 14.415 s |99999999999999999999999999999| 12.000 KiB access 100 % 1 m 7.650 s [...] |0| 4.000 KiB access 5 % 0 ns |0| 12.000 KiB access 5 % 0 ns |0| 188.000 KiB access 0 % 0 ns |0| 24.000 KiB access 0 % 0 ns |0| 48.000 KiB access 0 % 0 ns [...] |0000000000000000000000000000000| 8.000 KiB access 0 % 6 m 45.901 s |00000000000000000000000000000000| 36.000 KiB access 0 % 7 m 26.491 s |00000000000000000000000000000000| 4.000 KiB access 0 % 12 m 37.682 s |000000000000000000000000000000000| 8.000 KiB access 0 % 18 m 9.168 s |000000000000000000000000000000000| 16.000 KiB access 0 % 19 m 3.288 s |0000000000000000000000000000000000000000| 6.198 GiB access 0 % 18 h 57 m 52.582 s memory bw estimate: 8.798 GiB per second total size: 62.000 GiB We can see DAMON found small and extremely hot regions that accessed for all access check sampling (once per about 300 milliseconds) for more than 10 hours. The access temperature rapidly decreases. DAMON was also able to find small and big regions that not accessed for up to about 19 minutes. It even found an outlier cold region of 6 GiB that not accessed for about 19 hours. It is unclear what the outlier region is, as of this writing. For the testing, DAMON was consuming about 0.1% of single CPU time. This is again expected results, since DAMON was using about 370 milliseconds sampling interval in most case. # ps -p $kdamond_pid -o %cpu %CPU 0.1 I also ran similar tests against kernel build workload and an in-memory cache workload benchmark[2]. Detialed results including tuned intervals and captured access pattern were of course different sicne those depend on the workloads. But the auto-tuning feature was always working as expected like the above results for the real world workload. To wrap up, with intervals auto-tuning feature, DAMON was able to capture access pattern snapshots of a quality on a real world server workload. The auto-tuning feature was able to adaptively react to the dynamic access patterns of the workload and reliably provide consistent monitoring results without manual human interventions. Also, the auto-tuning made DAMON consumes only necessary amount of resource for the required quality. References ========== [1] https://github.com/damonitor/damo/blob/next/USAGE.md#access-report-styles [2] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md This patch (of 8): Add data structures for DAMON sampling and aggregation intervals automatic tuning that aims specific amount of DAMON-observed access events per snapshot. In more detail, define the data structure for the tuning goal, link it to the monitoring attributes data structure so that DAMON kernel API callers can make the request, and update parameters setup DAMON function to respect the new parameter. Link: https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250303221726.484227-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mmu_notifier: use MMU_NOTIFY_CLEAR in remove_device_exclusive_entry()David Hildenbrand
Let's limit the use of MMU_NOTIFY_EXCLUSIVE to the case where we convert a present PTE to device-exclusive. For the other case, we can simply use MMU_NOTIFY_CLEAR, because it really is clearing the device-exclusive entry first, to then install the present entry. Update the documentation of MMU_NOTIFY_EXCLUSIVE, to document the single use case more thoroughly. If ever required, we could add a separate MMU_NOTIFY_CLEAR_EXCLUSIVE; for now using MMU_NOTIFY_CLEAR seems to be sufficient. Link: https://lkml.kernel.org/r/20250226132257.2826043-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16cpu: remove needless return in void API suspend_enable_secondary_cpus()Zijun Hu
Remove needless 'return' in void API suspend_enable_secondary_cpus() since both the API and thaw_secondary_cpus() are void functions. Link: https://lkml.kernel.org/r/20250221-rmv_return-v1-2-cc8dff275827@quicinc.com Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16rhashtable: remove needless return in three void APIsZijun Hu
Remove needless 'return' in the following void APIs: rhltable_walk_enter() rhltable_free_and_destroy() rhltable_destroy() Since both the API and callee involved are void functions. Link: https://lkml.kernel.org/r/20250221-rmv_return-v1-16-cc8dff275827@quicinc.com Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16lib min_heap: use size_t for array size and index variablesKuan-Wei Chiu
Replace the int type with size_t for variables representing array sizes and indices in the min-heap implementation. Using size_t aligns with standard practices for size-related variables and avoids potential issues on platforms where int may be insufficient to represent all valid sizes or indices. Link: https://lkml.kernel.org/r/20250215165618.1757219-1-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: retire hw_protection_reboot and hw_protection_shutdown helpersAhmad Fatoum
The hw_protection_reboot and hw_protection_shutdown functions mix mechanism with policy: They let the driver requesting an emergency action for hardware protection also decide how to deal with it. This is inadequate in the general case as a driver reporting e.g. an imminent power failure can't know whether a shutdown or a reboot would be more appropriate for a given hardware platform. With the addition of the hw_protection parameter, it's now possible to configure at runtime the default emergency action and drivers are expected to use hw_protection_trigger to have this parameter dictate policy. As no current users of either hw_protection_shutdown or hw_protection_shutdown helpers remain, remove them, as not to tempt driver authors to call them. Existing users now either defer to hw_protection_trigger or call __hw_protection_trigger with a suitable argument directly when they have inside knowledge on whether a reboot or shutdown would be more appropriate. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-12-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: add support for configuring emergency hardware protection actionAhmad Fatoum
We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. In the general case, however, the driver detecting the issue can't know what the appropriate course of action is and shouldn't be dictating the policy of dealing with it. Therefore, add a global hw_protection toggle that allows the user to specify whether shutdown or reboot should be the default action when the driver doesn't set policy. This introduces no functional change yet as hw_protection_trigger() has no callers, but these will be added in subsequent commits. [arnd@arndb.de: hide unused hw_protection_attr] Link: https://lkml.kernel.org/r/20250224141849.1546019-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-7-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: rename now misleading __hw_protection_shutdown symbolsAhmad Fatoum
The __hw_protection_shutdown function name has become misleading since it can cause either a shutdown (poweroff) or a reboot depending on its argument. To avoid further confusion, let's rename it, so it doesn't suggest that a poweroff is all it can do. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-5-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: replace __hw_protection_shutdown bool action parameter with an enumAhmad Fatoum
Patch series "reboot: support runtime configuration of emergency hw_protection action", v3. We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. This is inadequate in the general case though as a driver reporting e.g. an imminent power failure can't know whether a shutdown or a reboot would be more appropriate for a given hardware platform. To address this, this series adds a hw_protection kernel parameter and sysfs toggle that can be used to change the action from the shutdown default to reboot. A new hw_protection_trigger API then makes use of this default action. My particular use case is unattended embedded systems that don't have support for shutdown and that power on automatically when power is supplied: - A brief power cycle gets detected by the driver - The kernel powers down the system and SoC goes into shutdown mode - Power is restored - The system remains oblivious to the restored power - System needs to be manually power cycled for a duration long enough to drain the capacitors With this series, such systems can configure the kernel with hw_protection=reboot to have the boot firmware worry about critical conditions. This patch (of 12): Currently __hw_protection_shutdown() either reboots or shuts down the system according to its shutdown argument. To make the logic easier to follow, both inside __hw_protection_shutdown and at caller sites, lets replace the bool parameter with an enum. This will be extra useful, when in a later commit, a third action is added to the enumeration. No functional change. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-0-e1c09b090c0c@pengutronix.de Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-1-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ucount: use rcuref_t for reference countingSebastian Andrzej Siewior
Use rcuref_t for reference counting. This eliminates the cmpxchg loop in the get and put path. This also eliminates the need to acquire the lock in the put path because once the final user returns the reference, it can no longer be obtained anymore. Use rcuref_t for reference counting. Link: https://lkml.kernel.org/r/20250203150525.456525-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ucount: use RCU for ucounts lookupsSebastian Andrzej Siewior
The ucounts element is looked up under ucounts_lock. This can be optimized by using RCU for a lockless lookup and return and element if the reference can be obtained. Replace hlist_head with hlist_nulls_head which is RCU compatible. Let find_ucounts() search for the required item within a RCU section and return the item if a reference could be obtained. This means alloc_ucounts() will always return an element (unless the memory allocation failed). Let put_ucounts() RCU free the element if the reference counter dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16rcu: provide a static initializer for hlist_nulls_headSebastian Andrzej Siewior
Patch series "ucount: Simplify refcounting with rcuref_t". I noticed that the atomic_dec_and_lock_irqsave() in put_ucounts() loops sometimes even during boot. Something like 2-3 iterations but still. This series replaces the refcounting with rcuref_t and adds a RCU lookup. This allows a lockless lookup in alloc_ucounts() if the entry is available and a cmpxchg()less put of the item. This patch (of 4): Provide a static initializer for hlist_nulls_head so that it can be used in statically defined data structures. Link: https://lkml.kernel.org/r/20250203150525.456525-1-bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20250203150525.456525-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16crash: let arch decide usable memory range in reserved areaSourabh Jain
Although the crashkernel area is reserved, on architectures like PowerPC, it is possible for the crashkernel reserved area to contain components like RTAS, TCE, OPAL, etc. To avoid placing kexec segments over these components, PowerPC has its own set of APIs to locate holes in the crashkernel reserved area. Add an arch hook in the generic locate mem hole APIs so that architectures can handle such special regions in the crashkernel area while locating memory holes for kexec segments using generic APIs. With this, a lot of redundant arch-specific code can be removed, as it performs the exact same job as the generic APIs. To keep the generic and arch-specific changes separate, the changes related to moving PowerPC to use the generic APIs and the removal of PowerPC-specific APIs for memory hole allocation are done in a subsequent patch titled "powerpc/crash: Use generic APIs to locate memory hole for kdump. Link: https://lkml.kernel.org/r/20250131113830.925179-4-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16crash: remove an unused argument from reserve_crashkernel_generic()Sourabh Jain
cmdline argument is not used in reserve_crashkernel_generic() so remove it. Correspondingly, all the callers have been updated as well. No functional change intended. Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, swap: simplify folio swap allocationKairui Song
With slot cache gone, clean up the allocation helpers even more. folio_alloc_swap will be the only entry for allocation and adding the folio to swap cache (except suspend), making it opposite of folio_free_swap. Link: https://lkml.kernel.org/r/20250313165935.63303-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, swap: remove swap slot cacheKairui Song
Slot cache is no longer needed now, removing it and all related code. - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) The performance changes are below noise level. - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) Similar to above, 64k mTHP case showed a slight improvement. Link: https://lkml.kernel.org/r/20250313165935.63303-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, swap: use percpu cluster as allocation fast pathKairui Song
Current allocation workflow first traverses the plist with a global lock held, after choosing a device, it uses the percpu cluster on that swap device. This commit moves the percpu cluster variable out of being tied to individual swap devices, making it a global percpu variable, and will be used directly for allocation as a fast path. The global percpu cluster variable will never point to a HDD device, and allocations on a HDD device are still globally serialized. This improves the allocator performance and prepares for removal of the slot cache in later commits. There shouldn't be much observable behavior change, except one thing: this changes how swap device allocation rotation works. Currently, each allocation will rotate the plist, and because of the existence of slot cache (one order 0 allocation usually returns 64 entries), swap devices of the same priority are rotated for every 64 order 0 entries consumed. High order allocations are different, they will bypass the slot cache, and so swap device is rotated for every 16K, 32K, or up to 2M allocation. The rotation rule was never clearly defined or documented, it was changed several times without mentioning. After this commit, and once slot cache is gone in later commits, swap device rotation will happen for every consumed cluster. Ideally non-HDD devices will be rotated if 2M space has been consumed for each order. Fragmented clusters will rotate the device faster, which seems OK. HDD devices is rotated for every allocation regardless of the allocation order, which should be OK too and trivial. This commit also slightly changes allocation behaviour for slot cache. The new added cluster allocation fast path may allocate entries from different device to the slot cache, this is not observable from user space, only impact performance very slightly, and slot cache will be just gone in next commit, so this can be ignored. Link: https://lkml.kernel.org/r/20250313165935.63303-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zsmalloc: introduce new object mapping APISergey Senozhatsky
Current object mapping API is a little cumbersome. First, it's inconsistent, sometimes it returns with page-faults disabled and sometimes with page-faults enabled. Second, and most importantly, it enforces atomicity restrictions on its users. zs_map_object() has to return a liner object address which is not always possible because some objects span multiple physical (non-contiguous) pages. For such objects zsmalloc uses a per-CPU buffer to which object's data is copied before a pointer to that per-CPU buffer is returned back to the caller. This leads to another, final, issue - extra memcpy(). Since the caller gets a pointer to per-CPU buffer it can memcpy() data only to that buffer, and during zs_unmap_object() zsmalloc will memcpy() from that per-CPU buffer to physical pages that object in question spans across. New API splits functions by access mode: - zs_obj_read_begin(handle, local_copy) Returns a pointer to handle memory. For objects that span two physical pages a local_copy buffer is used to store object's data before the address is returned to the caller. Otherwise the object's page is kmap_local mapped directly. - zs_obj_read_end(handle, buf) Unmaps the page if it was kmap_local mapped by zs_obj_read_begin(). - zs_obj_write(handle, buf, len) Copies len-bytes from compression buffer to handle memory (takes care of objects that span two pages). This does not need any additional (e.g. per-CPU) buffers and writes the data directly to zsmalloc pool pages. In terms of performance, on a synthetic and completely reproducible test that allocates fixed number of objects of fixed sizes and iterates over those objects, first mapping in RO then in RW mode: OLD API ======= 3 first results out of 10 369,205,778 instructions # 0.80 insn per cycle 40,467,926 branches # 113.732 M/sec 369,002,122 instructions # 0.62 insn per cycle 40,426,145 branches # 189.361 M/sec 369,036,706 instructions # 0.63 insn per cycle 40,430,860 branches # 204.105 M/sec [..] NEW API ======= 3 first results out of 10 265,799,293 instructions # 0.51 insn per cycle 29,834,567 branches # 170.281 M/sec 265,765,970 instructions # 0.55 insn per cycle 29,829,019 branches # 161.602 M/sec 265,764,702 instructions # 0.51 insn per cycle 29,828,015 branches # 189.677 M/sec [..] T-test on all 10 runs ===================== Difference at 95.0% confidence -1.03219e+08 +/- 55308.7 -27.9705% +/- 0.0149878% (Student's t, pooled s = 58864.4) The old API will stay around until the remaining users switch to the new one. After that we'll also remove zsmalloc per-CPU buffer and CPU hotplug handling. The split of map(RO) and map(WO) into read_{begin/end}/write is suggested by Yosry Ahmed. Link: https://lkml.kernel.org/r/20250303022425.285971-15-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/folio_queue: delete __folio_order and use folio_order directlyLiu Ye
__folio_order is the same as folio_order, remove __folio_order and then just include mm.h and use folio_order directly. Link: https://lkml.kernel.org/r/20250212025843.80283-2-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: implement a new DAMOS filter type for unmapped pagesSeongJae Park
Patch series "mm/damon: introduce DAMOS filter type for unmapped pages". User decides whether their memory will be mapped or unmapped. It implies that the two types of memory can have different characteristics and management requirements. Provide the DAMON-observaibility DAMOS-operation capability for the different types by introducing a new DAMOS filter type for unmapped pages. This patch (of 2): Implement yet another DAMOS filter type for unmapped pages on DAMON kernel API, and add support of it from the physical address space DAMON operations set (paddr). Since it is for only unmapped pages, support from the virtual address spaces DAMON operations set (vaddr) is not required. Link: https://lkml.kernel.org/r/20250219220146.133650-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250219220146.133650-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: enable bootmem allocation from CMA areasFrank van der Linden
If hugetlb_cma_only is enabled, we know that hugetlb pages can only be allocated from CMA. Now that there is an interface to do early reservations from a CMA area (returning memblock memory), it can be used to allocate hugetlb pages from CMA. This also allows for doing pre-HVO on these pages (if enabled). Make sure to initialize the page structures and associated data correctly. Create a flag to signal that a hugetlb page has been allocated from CMA to make things a little easier. Some configurations of powerpc have a special hugetlb bootmem allocator, so introduce a boolean arch_specific_huge_bootmem_alloc that returns true if such an allocator is present. In that case, CMA bootmem allocations can't be used, so check that function before trying. Link: https://lkml.kernel.org/r/20250228182928.2645936-27-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/cma: introduce a cma validate functionFrank van der Linden
Define a function to check if a CMA area is valid, which means: do its ranges not cross any zone boundaries. Store the result in the newly created flags for each CMA area, so that multiple calls are dealt with. This allows for checking the validity of a CMA area early, which is needed later in order to be able to allocate hugetlb bootmem pages from it with pre-HVO. Link: https://lkml.kernel.org/r/20250228182928.2645936-24-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: do pre-HVO for bootmem allocated pagesFrank van der Linden
For large systems, the overhead of vmemmap pages for hugetlb is substantial. It's about 1.5% of memory, which is about 45G for a 3T system. If you want to configure most of that system for hugetlb (e.g. to use as backing memory for VMs), there is a chance of running out of memory on boot, even though you know that the 45G will become available later. To avoid this scenario, and since it's a waste to first allocate and then free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages ('gigantic' pages). pre-HVO is done by adding functions that are called from sparse_init_nid_early and sparse_init_nid_late. The first is called before memmap allocation, so it takes care of allocating memmap HVO-style. The second verifies that all bootmem pages look good, specifically it checks that they do not intersect with multiple zones. This can only be done from sparse_init_nid_late path, when zones have been initialized. The hugetlb page size must be aligned to the section size, and aligned to the size of memory described by the number of page structures contained in one PMD (since pre-HVO is not prepared to split PMDs). This should be true for most 'gigantic' pages, it is for 1G pages on x86, where both of these alignment requirements are 128M. This will only have an effect if hugetlb_bootmem_alloc was called early in boot. If not, it won't do anything, and HVO for bootmem hugetlb pages works as before. Link: https://lkml.kernel.org/r/20250228182928.2645936-20-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: add pre-HVO frameworkFrank van der Linden
Define flags for pre-HVOed bootmem hugetlb pages, and act on them. The most important flag is the HVO flag, signalling that a bootmem allocated gigantic page has already been HVO-ed. If this flag is seen by the hugetlb bootmem gather code, the page is marked as HVO optimized. The HVO code will then not try to optimize it again. Instead, it will just map the tail page mirror pages read-only, completing the HVO steps. No functional change, as nothing sets the flags yet. Link: https://lkml.kernel.org/r/20250228182928.2645936-18-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: deal with multiple calls to hugetlb_bootmem_allocFrank van der Linden
Architectures that want pre-HVO of hugetlb vmemmap pages will need to call hugetlb_bootmem_alloc from an earlier spot in boot (before sparse_init). To facilitate some architectures doing this, protect hugetlb_bootmem_alloc against multiple calls. Also provide a helper function to check if it's been called, so that the early HVO code, to be added later, can see if there is anything to do. Link: https://lkml.kernel.org/r/20250228182928.2645936-16-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>