linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-03-17	dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for ↵	Krzysztof Kozlowski
	interrupt-names When narrowing properties per variant, the 'interrupt-names' should have the same constraints as 'interrupts'. Add missing upper bound on the property. Fixes: e69003202434 ("dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCM2290") Fixes: 7ae24e054f75 ("dt-bindings: cpufreq: cpufreq-qcom-hw: Sanitize data per compatible") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-03-17	dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible	Imran Shaik
	Document compatible for cpufreq hardware on Qualcomm QCS8300 platform. Signed-off-by: Imran Shaik <quic_imrashai@quicinc.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-03-17	erofs: enable 48-bit layout support	Gao Xiang
	Both 48-bit block addressing and encoded extents are implemented, let's enable them formally. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095625.2623817-1-hsiangkao@linux.alibaba.com
2025-03-17	erofs: support unaligned encoded data	Gao Xiang
	We're almost there. It's straight-forward to adapt the current decompression subsystem to support unaligned encoded (compressed) data. Note that unaligned data is not encouraged because of worse I/O and caching efficiency unless the corresponding compressor doesn't support fixed-sized output compression natively like Zstd. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-10-hsiangkao@linux.alibaba.com
2025-03-17	erofs: implement encoded extent metadata	Gao Xiang
	Implement the extent metadata parsing described in the previous commit. For 16-byte and 32-byte extent records, currently it is just a trivial binary search without considering the last access footprint, but it can be optimized for better sequential performance later. Tail fragments are supported, but ztailpacking feature is not for simplicity. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-9-hsiangkao@linux.alibaba.com
2025-03-17	erofs: add encoded extent on-disk definition	Gao Xiang
	Previously, EROFS provided both (non-)compact compressed indexes to keep necessary hints for each logical block, enabling O(1) random indexing. This approach was originally designed for small compression units (e.g., 4KiB), where compressed data is strictly block-aligned via fixed-sized output compression. However, EROFS now supports big pclusters up to 1MiB and many users use large configurations to minimize image sizes. For such configurations, the total number of extents decreases significantly (e.g., only 1,024 extents for a 1GiB file using 1MiB pclusters), then runtime metadata overhead becomes negligible compared to data I/O and decoding costs. Additionally, some popular compression algorithm (mainly Zstd) still lacks native fixed-sized output compression support (although it's planned by their authors). Instead of just waiting for compressor improvements, let's adopt byte-oriented extents, allowing these compressors to retain their current methods. For example, it speeds up Zstd compression a lot: Processor: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz * 96 Dataset: enwik9 Build time Size Type Command Line 3m52.339s 266653696 FO -C524288 -zzstd,22 3m48.549s 266174464 FO -E48bit -C524288 -zzstd,22 0m12.821s 272134144 FI -E48bit -C1048576 --max-extent-bytes=1048576 -zzstd,22 0m14.528s 248987648 FO -C1048576 -zlzma,9 0m14.605s 248504320 FO -E48bit -C1048576 -zlzma,9 Encoded extents are structured as an array of `struct z_erofs_extent`, sorted by logical address in ascending order: __le32 plen // encoded length, algorithm id and flags __le32 pstart_lo // physical offset LSB __le32 pstart_hi // physical offset MSB __le32 lstart_lo // logical offset __le32 lstart_hi // logical offset MSB .. Note that prefixed reduced records can be used to minimize metadata for specific cases (e.g. lstart less than 32 bits, then 32 to 16 bytes). If the logical lengths of all encoded extents are the same, 4-byte (plen) and 8-byte (plen, pstart_lo) records can be used. Or, 16-byte (plen .. lstart_lo) and 32-byte full records have to be used instead. If 16-byte and 32-byte records are used, the total number of extents is kept in `struct z_erofs_map_header`, and binary search can be applied on them. Note that `eytzinger order` is not considerd because data sequential access is important. If 4-byte records are used, 8-byte start physical offset is between `struct z_erofs_map_header` and the `plen` array. In addition, 64-bit physical offsets can be applied with new encoded extent format to match full 48-bit block addressing. Remove redundant comments around `struct z_erofs_lcluster_index` too. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-8-hsiangkao@linux.alibaba.com
2025-03-17	erofs: initialize decompression early	Gao Xiang
	- Rename erofs_init_managed_cache() to z_erofs_init_super(); - Move the initialization of managed_pslots into z_erofs_init_super() too; - Move z_erofs_init_super() and packed inode preparation upwards, before the root inode initialization. Therefore, the root directory can also be compressible. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250317054840.3483000-1-hsiangkao@linux.alibaba.com
2025-03-17	cpufreq: Init cpufreq only for present CPUs	Jacky Bai
	for_each_possible_cpu() is currently used to initialize cpufreq. However, in cpu_dev_register_generic(), for_each_present_cpu() is used to register CPU devices which means the CPU devices are only registered for present CPUs and not all possible CPUs. With nosmp or maxcpus=0, only the boot CPU is present, lead to the cpufreq probe failure or defer probe due to no cpu device available for not present CPUs. Change for_each_possible_cpu() to for_each_present_cpu() in the above cpufreq drivers to ensure it only registers cpufreq for CPUs that are actually present. Fixes: b0c69e1214bc ("drivers: base: Use present CPUs in GENERIC_CPU_DEVICES") Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Jacky Bai <ping.bai@nxp.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-03-16	ucount: use rcuref_t for reference counting	Sebastian Andrzej Siewior
	Use rcuref_t for reference counting. This eliminates the cmpxchg loop in the get and put path. This also eliminates the need to acquire the lock in the put path because once the final user returns the reference, it can no longer be obtained anymore. Use rcuref_t for reference counting. Link: https://lkml.kernel.org/r/20250203150525.456525-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	ucount: use RCU for ucounts lookups	Sebastian Andrzej Siewior
	The ucounts element is looked up under ucounts_lock. This can be optimized by using RCU for a lockless lookup and return and element if the reference can be obtained. Replace hlist_head with hlist_nulls_head which is RCU compatible. Let find_ucounts() search for the required item within a RCU section and return the item if a reference could be obtained. This means alloc_ucounts() will always return an element (unless the memory allocation failed). Let put_ucounts() RCU free the element if the reference counter dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	ucount: replace get_ucounts_or_wrap() with atomic_inc_not_zero()	Sebastian Andrzej Siewior
	get_ucounts_or_wrap() increments the counter and if the counter is negative then it decrements it again in order to reset the previous increment. This statement can be replaced with atomic_inc_not_zero() to only increment the counter if it is not yet 0. This simplifies the get function because the put (if the get failed) can be removed. atomic_inc_not_zero() is implement as a cmpxchg() loop which can be repeated several times if another get/put is performed in parallel. This will be optimized later. Increment the reference counter only if not yet dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	rcu: provide a static initializer for hlist_nulls_head	Sebastian Andrzej Siewior
	Patch series "ucount: Simplify refcounting with rcuref_t". I noticed that the atomic_dec_and_lock_irqsave() in put_ucounts() loops sometimes even during boot. Something like 2-3 iterations but still. This series replaces the refcounting with rcuref_t and adds a RCU lookup. This allows a lockless lookup in alloc_ucounts() if the entry is available and a cmpxchg()less put of the item. This patch (of 4): Provide a static initializer for hlist_nulls_head so that it can be used in statically defined data structures. Link: https://lkml.kernel.org/r/20250203150525.456525-1-bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20250203150525.456525-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	lib/zlib: drop EQUAL macro	Yury Norov
	The macro is prehistoric, and only exists to help those readers who don't know what memcmp() returns if memory areas differ. This is pretty well documented, so the macro looks excessive. Now that the only user of the macro depends on DEBUG_ZLIB config, GCC warns about unused macro if the library is built with W=2 against defconfig. So drop it for good. Link: https://lkml.kernel.org/r/20250205212933.68695-1-yury.norov@gmail.com Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Heiko Carsten <heiko.carstens@de.ibm.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	get_maintainer: stop reporting subsystem status as maintainer role	Vlastimil Babka
	After introducing the --substatus option, we can stop adjusting the reported maintainer role by the subsystem's status. For compatibility with the --git-chief-penguins option, keep the "chief penguin" role. Link: https://lkml.kernel.org/r/20250203-b4-get_maintainer-v2-2-83ba008b491f@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bryan O'Donoghue <bryan.odonoghue@linaro.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <kees@kernel.org> Cc: Ted Ts'o <tytso@mit.edu> Cc: Thorsten Leemhuis <linux@leemhuis.info> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	get_maintainer: add --substatus for reporting subsystem status	Vlastimil Babka
	Patch series "get_maintainer: report subsystem status separately", v2. The subsystem status (S: field) can inform a patch submitter if the subsystem is well maintained or e.g. maintainers are missing. In get_maintainer, it is currently reported with --role(stats) by adjusting the maintainer role for any status different from Maintained. This has two downsides: - if a subsystem has only reviewers or mailing lists and no maintainers, the status is not reported. For example Orphan subsystems typically have no maintainers so there's nobody to report as orphan minder. - the Supported status means that someone is paid for maintaining, but it is reported as "supporter" for all the maintainers, which can be incorrect (only some of them may be paid). People (including myself) have been also confused about what "supporter" means. The second point has been brought up in 2022 and the discussion in the end resulted in adjusting documentation only [1]. I however agree with Ted's points that it's misleading to take the subsystem status and apply it to all maintainers [2]. The attempt to modify get_maintainer output was retracted after Joe objected that the status becomes not reported at all [3]. This series addresses that concern by reporting the status (unless it's the most common Maintained one) on separate lines that follow the reported emails, using a new --substatus parameter. Care is taken to reduce the noise to minimum by not reporting the most common Maintained status, by default require no opt-in that would need the users to discover the new parameter, and at the same time not to break existing git --cc-cmd usage. [1] https://lore.kernel.org/all/20221006162413.858527-1-bryan.odonoghue@linaro.org/ [2] https://lore.kernel.org/all/Yzen4X1Na0MKXHs9@mit.edu/ [3] https://lore.kernel.org/all/30776fe75061951777da8fa6618ae89bea7a8ce4.camel@perches.com/ This patch (of 2): The subsystem status is currently reported with --role(stats) by adjusting the maintainer role for any status different from Maintained. This has two downsides: - if a subsystem has only reviewers or mailing lists and no maintainers, the status is not reported (i.e. typically, Orphan subsystems have no maintainers) - the Supported status means that someone is paid for maintaining, but it is reported as "supporter" for all the maintainers, which can be incorrect. People have been also confused about what "supporter" means. This patch introduces a new --substatus option and functionality aimed to report the subsystem status separately, without adjusting the reported maintainer role. After the e-mails are output, the status of subsystems will follow, for example: ... linux-kernel@vger.kernel.org (open list:LIBRARY CODE) LIBRARY CODE status: Supported In order to allow replacing the role rewriting seamlessly, the new option works as follows: - it is automatically enabled when --email and --role are enabled (the defaults include --email and --rolestats which implies --role) - usages with --norolestats e.g. for git's --cc-cmd will thus need no adjustments - the most common Maintained status is not reported at all, to reduce unnecessary noise - THE REST catch-all section (contains lkml) status is not reported - the existing --subsystem and --status options are unaffected so their users will need no adjustments [vbabka@suse.cz: require that script output goes to a terminal] Link: https://lkml.kernel.org/r/66c2bf7a-9119-4850-b6b8-ac8f426966e1@suse.cz Link: https://lkml.kernel.org/r/20250203-b4-get_maintainer-v2-0-83ba008b491f@suse.cz Link: https://lkml.kernel.org/r/20250203-b4-get_maintainer-v2-1-83ba008b491f@suse.cz Fixes: c1565b6f7b53 ("get_maintainer: add --substatus for reporting subsystem status") Closes: https://lore.kernel.org/all/7aodxv46lj6rthjo4i5zhhx2lybrhb4uknpej2dyz3e7im5w3w@w23bz6fx3jnn/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Uwe Kleine-K=F6nig <u.kleine-koenig@baylibre.com> Cc: Bryan O'Donoghue <bryan.odonoghue@linaro.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <kees@kernel.org> Cc: Ted Ts'o <tytso@mit.edu> Cc: Thorsten Leemhuis <linux@leemhuis.info> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	powerpc/crash: use generic crashkernel reservation	Sourabh Jain
	Commit 0ab97169aa05 ("crash_core: add generic function to do reservation") added a generic function to reserve crashkernel memory. So let's use the same function on powerpc and remove the architecture-specific code that essentially does the same thing. The generic crashkernel reservation also provides a way to split the crashkernel reservation into high and low memory reservations, which can be enabled for powerpc in the future. Along with moving to the generic crashkernel reservation, the code related to finding the base address for the crashkernel has been separated into its own function name get_crash_base() for better readability and maintainability. Link: https://lkml.kernel.org/r/20250131113830.925179-8-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	powerpc: insert System RAM resource to prevent crashkernel conflict	Sourabh Jain
	The next patch in the series with title "powerpc/crash: use generic crashkernel reservation" enables powerpc to use generic crashkernel reservation instead of custom implementation. This leads to exporting of `Crash Kernel` memory to iomem_resource (/proc/iomem) via insert_crashkernel_resources():kernel/crash_reserve.c or at another place in the same file if HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY is set. The add_system_ram_resources():arch/powerpc/mm/mem.c adds `System RAM` to iomem_resource using request_resource(). This creates a conflict with `Crash Kernel`, which is added by the generic crashkernel reservation code. As a result, the kernel ultimately fails to add `System RAM` to iomem_resource. Consequently, it does not appear in /proc/iomem. There are multiple approches tried to avoid this: 1. Don't add Crash Kernel to iomem_resource: - This has two issues. First, it requires adding an architecture-specific hook in the generic code. There are already two code paths to choose when to add `Crash Kernel` to iomem_resource. This adds one more code path to skip it. Second, what if `Crash Kernel` is required in /proc/iomem in the future? Many architectures do export it. 2. Don't add `System RAM` to iomem_resource by reverting commit c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem"): - It's not ideal to export `System RAM` via /proc/iomem, but since it already done ealier and userspace tools like kdump and kdump-utils rely on `System RAM` from /proc/iomem, removing it will break userspace. 3. Add Crash Kernel along with System RAM to /proc/iomem: This patch takes the third approach by updating add_system_ram_resources() to use insert_resource() instead of the request_resource() API to add the `System RAM` resource to iomem_resource. insert_resource() allows inserting resources even if they overlap with existing ones. Since `Crash Kernel` and `System RAM` resources are added to iomem_resource early in the boot, any other conflict is not expected. With the changes introduced here and in the next patch, "powerpc/crash: use generic crashkernel reservation," /proc/iomem now exports `System RAM` and `Crash Kernel` as shown below: $ cat /proc/iomem 00000000-3ffffffff : System RAM 10000000-4fffffff : Crash kernel The kdump script is capable of handling `System RAM` and `Crash Kernel` in the above format. The same format is used in other architectures. Link: https://lkml.kernel.org/r/20250131113830.925179-7-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	powerpc/crash: preserve user-specified memory limit	Sourabh Jain
	Commit 59d58189f3d9 ("crash: fix crash memory reserve exceed system memory bug") fails crashkernel parsing if the crash size is found to be higher than system RAM, which makes the memory_limit adjustment code ineffective due to an early exit from reserve_crashkernel(). Regardless lets not violate the user-specified memory limit by adjusting it. Remove this adjustment to ensure all reservations stay within the limit. Commit f94f5ac07983 ("powerpc/fadump: Don't update the user-specified memory limit") did the same for fadump. Link: https://lkml.kernel.org/r/20250131113830.925179-6-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	powerpc/crash: use generic APIs to locate memory hole for kdump	Sourabh Jain
	On PowerPC, the memory reserved for the crashkernel can contain components like RTAS, TCE, OPAL, etc., which should be avoided when loading kexec segments into crashkernel memory. Due to these special components, PowerPC has its own set of APIs to locate holes in the crashkernel memory for loading kexec segments for kdump. However, for loading kexec segments in the kexec case, PowerPC already uses generic APIs to locate holes. The previous patch in this series, titled "crash: Let arch decide usable memory range in reserved area," introduced arch-specific hook to handle such special regions in the crashkernel area. So, switch PowerPC to use the generic APIs to locate memory holes for kdump and remove the redundant PowerPC-specific APIs. Link: https://lkml.kernel.org/r/20250131113830.925179-5-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	crash: let arch decide usable memory range in reserved area	Sourabh Jain
	Although the crashkernel area is reserved, on architectures like PowerPC, it is possible for the crashkernel reserved area to contain components like RTAS, TCE, OPAL, etc. To avoid placing kexec segments over these components, PowerPC has its own set of APIs to locate holes in the crashkernel reserved area. Add an arch hook in the generic locate mem hole APIs so that architectures can handle such special regions in the crashkernel area while locating memory holes for kexec segments using generic APIs. With this, a lot of redundant arch-specific code can be removed, as it performs the exact same job as the generic APIs. To keep the generic and arch-specific changes separate, the changes related to moving PowerPC to use the generic APIs and the removal of PowerPC-specific APIs for memory hole allocation are done in a subsequent patch titled "powerpc/crash: Use generic APIs to locate memory hole for kdump. Link: https://lkml.kernel.org/r/20250131113830.925179-4-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	crash: remove an unused argument from reserve_crashkernel_generic()	Sourabh Jain
	cmdline argument is not used in reserve_crashkernel_generic() so remove it. Correspondingly, all the callers have been updated as well. No functional change intended. Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	kexec: initialize ELF lowest address to ULONG_MAX	Sourabh Jain
	Patch series "powerpc/crash: use generic crashkernel reservation", v3. Commit 0ab97169aa05 ("crash_core: add generic function to do reservation") added a generic function to reserve crashkernel memory. So let's use the same function on powerpc and remove the architecture-specific code that essentially does the same thing. The generic crashkernel reservation also provides a way to split the crashkernel reservation into high and low memory reservations, which can be enabled for powerpc in the future. Additionally move powerpc to use generic APIs to locate memory hole for kexec segments while loading kdump kernel. This patch (of 7): kexec_elf_load() loads an ELF executable and sets the address of the lowest PT_LOAD section to the address held by the lowest_load_addr function argument. To determine the lowest PT_LOAD address, a local variable lowest_addr (type unsigned long) is initialized to UINT_MAX. After loading each PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD address is lower, lowest_addr is updated. However, setting lowest_addr to UINT_MAX won't work when the kernel image is loaded above 4G, as the returned lowest PT_LOAD address would be invalid. This is resolved by initializing lowest_addr to ULONG_MAX instead. This issue was discovered while implementing crashkernel high/low reservation on the PowerPC architecture. Link: https://lkml.kernel.org/r/20250131113830.925179-1-sourabhjain@linux.ibm.com Link: https://lkml.kernel.org/r/20250131113830.925179-2-sourabhjain@linux.ibm.com Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	lib/plist.c: add shortcut for plist_requeue()	I Hsin Cheng
	In the operation of plist_requeue(), "node" is deleted from the list before queueing it back to the list again, which involves looping to find the tail of same-prio entries. If "node" is the head of same-prio entries which means its prio_list is on the priority list, then "node_next" can be retrieve immediately by the next entry of prio_list, instead of looping nodes on node_list. The shortcut implementation can benefit plist_requeue() running the below test, and the test result is shown in the following table. One can observe from the test result that when the number of nodes of same-prio entries is smaller, then the probability of hitting the shortcut can be bigger, thus the benefit can be more significant. While it tends to behave almost the same for long same-prio entries, since the probability of taking the shortcut is much smaller. ----------------------------------------------------------------------- \| Test size \| 200 \| 400 \| 600 \| 800 \| 1000 \| ----------------------------------------------------------------------- \| new_plist_requeue \| 271521\| 1007913\| 2148033\| 4346792\| 12200940\| ----------------------------------------------------------------------- \| old_plist_requeue \| 301395\| 1105544\| 2488301\| 4632980\| 12217275\| ----------------------------------------------------------------------- The test is done on x86_64 architecture with v6.9 kernel and Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz. Test script( executed in kernel module mode ): int init_module(void) { unsigned int test_data[test_size]; /* Split the list into 10 different priority * , when test_size is larger, the number of * nodes within each priority is larger. */ for (i = 0; i < ARRAY_SIZE(test_data); i++) { test_data[i] = i % 10; } ktime_t start, end, time_elapsed = 0; plist_head_init(&test_head_local); for (i = 0; i < ARRAY_SIZE(test_node_local); i++) { plist_node_init(test_node_local + i, 0); test_node_local[i].prio = test_data[i]; } for (i = 0; i < ARRAY_SIZE(test_node_local); i++) { if (plist_node_empty(test_node_local + i)) { plist_add(test_node_local + i, &test_head_local); } } for (i = 0; i < ARRAY_SIZE(test_node_local); i += 1) { start = ktime_get(); plist_requeue(test_node_local + i, &test_head_local); end = ktime_get(); time_elapsed += (end - start); } pr_info("plist_requeue() elapsed time : %lld, size %d\n", time_elapsed, test_size); return 0; } [akpm@linux-foundation.org: tweak comment and code layout] Link: https://lkml.kernel.org/r/20250119062408.77638-1-richard120310@gmail.com Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	docs,procfs: document /proc/PID/* access permission checks	Andrii Nakryiko
	Add a paragraph explaining what sort of capabilities a process would need to read procfs data for some other process. Also mention that reading data for its own process doesn't require any extra permissions. Link: https://lkml.kernel.org/r/20250129001747.759990-1-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	.mailmap: remove redundant mappings of emails	Carlos Bilbao
	Remove two redundant mappings: changbin.du@intel.com -> changbin.du@intel.com viresh.kumar@linaro.org -> viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20250129013430.1117720-1-carlos.bilbao@kernel.org Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	scripts: add script to extract built-in firmware blobs	Guilherme G. Piccoli
	There is currently no tool to extract a firmware blob that is built-in on vmlinux to the best of my knowledge. So if we have a kernel image containing the blobs, and we want to rebuild the kernel with some debug patches for example (and given that the image also has IKCONFIG=y), we currently can't do that for the same versions for all the firmware blobs, _unless_ we have exact commits of linux-firmware for the specific versions for each firmware included. Through the options CONFIG_EXTRA_FIRMWARE{_DIR} one is able to build a kernel including firmware blobs in a built-in fashion. This is usually the case of built-in drivers that require some blobs in order to work properly, for example, like in non-initrd based systems. Add hereby a script to extract these blobs from a non-stripped vmlinux, similar to the idea of "extract-ikconfig". The firmware loader interface saves such built-in blobs as rodata entries, having a field for the FW name as "_fw_<module_name>_<firmware_name>_bin"; the tool extracts files named "<module_name>_<firmware_name>" for each rodata firmware entry detected. It makes use of awk, bash, dd and readelf, pretty standard tooling for Linux development. With this tool, we can blindly extract the FWs and easily re-add them in the new debug kernel build, allowing a more deterministic testing without the burden of "hunting down" the proper version of each firmware binary. Link: https://lkml.kernel.org/r/20250120190436.127578-1-gpiccoli@igalia.com Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Russ Weight <russ.weight@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	MAINTAINERS: add Yang Yang as a co-maintainer of PER-TASK DELAY ACCOUNTING	Yang Yang
	Balbir Singh is the unique maintainer of PER-TASK DELAY ACCOUNTING, and he had started work on cgroupstats a long time back, this subsystem then is not growing at a very rapid pace. With their excellent work delay accounting is still very useful for observing and optimizing system delay, but still needs continuous improvement. Yang Yang with his team had worked for most of the recent patches of the subsystem, and he has a strong willing to help, Balbir Singh is glad to see that, so add him as a co-maintainer. Link: https://lkml.kernel.org/r/20250117222013817zWHgBaSigRI_eRJt1hqnu@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm,procfs: allow read-only remote mm access under CAP_PERFMON	Andrii Nakryiko
	It's very common for various tracing and profiling toolis to need to access /proc/PID/maps contents for stack symbolization needs to learn which shared libraries are mapped in memory, at which file offset, etc. Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are looking at data for our own process, which is a trivial case not too relevant for profilers use cases). Unfortunately, CAP_SYS_PTRACE implies way more than just ability to discover memory layout of another process: it allows to fully control arbitrary other processes. This is problematic from security POV for applications that only need read-only /proc/PID/maps (and other similar read-only data) access, and in large production settings CAP_SYS_PTRACE is frowned upon even for the system-wide profilers. On the other hand, it's already possible to access similar kind of information (and more) with just CAP_PERFMON capability. E.g., setting up PERF_RECORD_MMAP collection through perf_event_open() would give one similar information to what /proc/PID/maps provides. CAP_PERFMON, together with CAP_BPF, is already a very common combination for system-wide profiling and observability application. As such, it's reasonable and convenient to be able to access /proc/PID/maps with CAP_PERFMON capabilities instead of CAP_SYS_PTRACE. For procfs, these permissions are checked through common mm_access() helper, and so we augment that with cap_perfmon() check only if requested mode is PTRACE_MODE_READ. I.e., PTRACE_MODE_ATTACH wouldn't be permitted by CAP_PERFMON. So /proc/PID/mem, which uses PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps, /proc/PID/environ, and a bunch of other read-only contents will be allowable under CAP_PERFMON. Besides procfs itself, mm_access() is used by process_madvise() and process_vm_{readv,writev}() syscalls. The former one uses PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON seems like a meaningful allowable capability as well. process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of permissions (though for readv PTRACE_MODE_READ seems more reasonable, but that's outside the scope of this change), and as such won't be affected by this patch. Link: https://lkml.kernel.org/r/20250127222114.1132392-1-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm/page_alloc: warn on nr_reserved_highatomic underflow	Brendan Jackman
	As documented in the comment this underflow should not happen. The locking has indeed changed here since the comment was written, see the migratetype hygiene patches[0]. However, those changes made the locking _safer_, so the underflow _really_ shouldn't happen now. So upgrade the comment to a warning. [0] https://lore.kernel.org/all/20240320180429.678181-7-hannes@cmpxchg.org/T/#m3da87e6cc3348a4640aa298137bc9f8f61b76c84 Link: https://lkml.kernel.org/r/20250225-warn-underflow-v1-1-3dc542941d3a@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	vmalloc: drop Christoph from Reviewers	Christoph Hellwig
	I haven't been doing as much review as I should. As part of reducing my inbox flow drop me from the official Reviewers. I might still chime in on patches occasionally. Link: https://lkml.kernel.org/r/20250224163033.350072-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: simplify folio swap allocation	Kairui Song
	With slot cache gone, clean up the allocation helpers even more. folio_alloc_swap will be the only entry for allocation and adding the folio to swap cache (except suspend), making it opposite of folio_free_swap. Link: https://lkml.kernel.org/r/20250313165935.63303-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: remove swap slot cache	Kairui Song
	Slot cache is no longer needed now, removing it and all related code. - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) The performance changes are below noise level. - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) Similar to above, 64k mTHP case showed a slight improvement. Link: https://lkml.kernel.org/r/20250313165935.63303-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: use percpu cluster as allocation fast path	Kairui Song
	Current allocation workflow first traverses the plist with a global lock held, after choosing a device, it uses the percpu cluster on that swap device. This commit moves the percpu cluster variable out of being tied to individual swap devices, making it a global percpu variable, and will be used directly for allocation as a fast path. The global percpu cluster variable will never point to a HDD device, and allocations on a HDD device are still globally serialized. This improves the allocator performance and prepares for removal of the slot cache in later commits. There shouldn't be much observable behavior change, except one thing: this changes how swap device allocation rotation works. Currently, each allocation will rotate the plist, and because of the existence of slot cache (one order 0 allocation usually returns 64 entries), swap devices of the same priority are rotated for every 64 order 0 entries consumed. High order allocations are different, they will bypass the slot cache, and so swap device is rotated for every 16K, 32K, or up to 2M allocation. The rotation rule was never clearly defined or documented, it was changed several times without mentioning. After this commit, and once slot cache is gone in later commits, swap device rotation will happen for every consumed cluster. Ideally non-HDD devices will be rotated if 2M space has been consumed for each order. Fragmented clusters will rotate the device faster, which seems OK. HDD devices is rotated for every allocation regardless of the allocation order, which should be OK too and trivial. This commit also slightly changes allocation behaviour for slot cache. The new added cluster allocation fast path may allocate entries from different device to the slot cache, this is not observable from user space, only impact performance very slightly, and slot cache will be just gone in next commit, so this can be ignored. Link: https://lkml.kernel.org/r/20250313165935.63303-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: don't update the counter up-front	Kairui Song
	The counter update before allocation design was useful to avoid unnecessary scan when device is full, so it will abort early if the counter indicates the device is full. But that is an uncommon case, and now scanning of a full device is very fast, so the up-front update is not helpful any more. Remove it and simplify the slot allocation logic. Link: https://lkml.kernel.org/r/20250313165935.63303-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: avoid redundant swap device pinning	Kairui Song
	Currently __read_swap_cache_async() has get/put_swap_device() calls to increase/decrease a swap device reference to prevent swapoff. While some of its callers have already held the swap device reference, e.g in do_swap_page() and shmem_swapin_folio() where __read_swap_cache_async() will finally called. Now there are only two callers not holding a swap device reference, so make them hold a reference instead. And drop the get/put_swap_device calls in __read_swap_cache_async. This should reduce the overhead for swap in during page fault slightly. Link: https://lkml.kernel.org/r/20250313165935.63303-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: drop the flag TTRS_DIRECT	Kairui Song
	This flag exists temporarily to allow the allocator to bypass the slot cache during freeing, so reclaiming one slot will free the slot immediately. But now we have already removed slot cache usage on freeing, so this flag has no effect now. Link: https://lkml.kernel.org/r/20250313165935.63303-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: avoid reclaiming irrelevant swap cache	Kairui Song
	Patch series "mm, swap: remove swap slot cache", v3. Slot cache was initially introduced by commit 67afa38e012e ("mm/swap: add cache for swap slots allocation") to reduce the lock contention of si->lock. Previous series "mm, swap: rework of swap allocator locks" [1] removed swap slot cache for freeing path as freeing path no longer touches si->lock in most cased. Allocation path also have slight to none contention on si->lock since that series, but slot cache still helps to reduce other overheads, like counters and the plist. This series removes the slot cache from allocation path too, by using the cluster as allocation fast path and also reduce other overheads. Now slot cache is completely gone, the code is much simplified without obvious feature or performance change, also clean up related workaround. Also this should avoid other potential issues, e.g. the long pinning of swap slots: swap slot cache pins swap slots with HAS_CACHE, causing reclaim or allocation fail to use these slots on scanning. The only behavior change is the swap device allocation rotation mechanism, as explained in the patch "mm, swap: use percpu cluster as allocation fast path". Test results are looking good after deleting the swap slot cache: - vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`, 12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs), 16 test runs for each case, measuring the total throughput: Before (KB/s) (stdev) After (KB/s) (stdev) Random (4K): 424907.60 (24410.78) 414745.92 (34554.78) Random (64K): 163308.82 (11635.72) 167314.50 (18434.99) Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16) - Build linux kernel with make -j96, using 4K folio with 1.5G memory cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs, 12 test runs, measuring the system time: Before (s) (stdev) After (s) (stdev) make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46) make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55) The performance is unchanged, slightly better in some cases. [1] https://lore.kernel.org/linux-mm/20250113175732.48099-1-ryncsn@gmail.com/ This patch (of 7): Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for allocation. It initiates the reclaim from the offset to be reclaimed and looks up the corresponding folio. The lookup process is lockless, so it's possible the folio will be removed from the swap cache and given a different swap entry before the reclaim locks the folio. If it happens, the reclaim will end up reclaiming an irrelevant folio, and return wrong return value. This shouldn't cause any problem with correctness or stability, but it is indeed confusing and unexpected, and will increase fragmentation, decrease performance. Fix this by checking whether the folio is still pointing to the offset the allocator want to reclaim before reclaiming it. Link: https://lkml.kernel.org/r/20250313165935.63303-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250313165935.63303-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm: make page_mapped_in_vma() hugetlb walk aware	Jane Chu
	When a process consumes a UE in a page, the memory failure handler attempts to collect information for a potential SIGBUS. If the page is an anonymous page, page_mapped_in_vma(page, vma) is invoked in order to 1. retrieve the vaddr from the process' address space, 2. verify that the vaddr is indeed mapped to the poisoned page, where 'page' is the precise small page with UE. It's been observed that when injecting poison to a non-head subpage of an anonymous hugetlb page, no SIGBUS shows up, while injecting to the head page produces a SIGBUS. The cause is that, though hugetlb_walk() returns a valid pmd entry (on x86), but check_pte() detects mismatch between the head page per the pmd and the input subpage. Thus the vaddr is considered not mapped to the subpage and the process is not collected for SIGBUS purpose. This is the calling stack: collect_procs_anon page_mapped_in_vma page_vma_mapped_walk hugetlb_walk huge_pte_lock check_pte check_pte() header says that it "check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte" but practically works only if pvmw->pfn is the head page pfn at pvmw->pte. Hindsight acknowledging that some pvmw->pte could point to a hugepage of some sort such that it makes sense to make check_pte() work for hugepage. Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm: page_alloc: group fallback functions together	Johannes Weiner
	The way the fallback rules are spread out makes them hard to follow. Move the functions next to each other at least. Link: https://lkml.kernel.org/r/20250225001023.1494422-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm: page_alloc: remove remnants of unlocked migratetype updates	Johannes Weiner
	The freelist hygiene patches made migratetype accesses fully protected under the zone->lock. Remove remnants of handling the race conditions that existed before from the MIGRATE_HIGHATOMIC code. Link: https://lkml.kernel.org/r/20250225001023.1494422-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm: page_alloc: don't steal single pages from biggest buddy	Johannes Weiner
	The fallback code searches for the biggest buddy first in an attempt to steal the whole block and encourage type grouping down the line. The approach used to be this: - Non-movable requests will split the largest buddy and steal the remainder. This splits up contiguity, but it allows subsequent requests of this type to fall back into adjacent space. - Movable requests go and look for the smallest buddy instead. The thinking is that movable requests can be compacted, so grouping is less important than retaining contiguity. c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") enforces freelist type hygiene, which restricts stealing to either claiming the whole block or just taking the requested chunk; no additional pages or buddy remainders can be stolen any more. The patch mishandled when to switch to finding the smallest buddy in that new reality. As a result, it may steal the exact request size, but from the biggest buddy. This causes fracturing for no good reason. Fix this by committing to the new behavior: either steal the whole block, or fall back to the smallest buddy. Remove single-page stealing from steal_suitable_fallback(). Rename it to try_to_steal_block() to make the intentions clear. If this fails, always fall back to the smallest buddy. The following is from 4 runs of mmtest's thpchallenge. "Pollute" is single page fallback, "steal" is conversion of a partially used block. The numbers for free block conversions (omitted) are comparable. vanilla patched @pollute[unmovable from reclaimable]: 27 106 @pollute[unmovable from movable]: 82 46 @pollute[reclaimable from unmovable]: 256 83 @pollute[reclaimable from movable]: 46 8 @pollute[movable from unmovable]: 4841 868 @pollute[movable from reclaimable]: 5278 12568 @steal[unmovable from reclaimable]: 11 12 @steal[unmovable from movable]: 113 49 @steal[reclaimable from unmovable]: 19 34 @steal[reclaimable from movable]: 47 21 @steal[movable from unmovable]: 250 183 @steal[movable from reclaimable]: 81 93 The allocator appears to do a better job at keeping stealing and polluting to the first fallback preference. As a result, the numbers for "from movable" - the least preferred fallback option, and most detrimental to compactability - are down across the board. Link: https://lkml.kernel.org/r/20250225001023.1494422-2-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	tools/selftests: add guard region test for /proc/$pid/pagemap	Lorenzo Stoakes
	Add a test to the guard region self tests to assert that the /proc/$pid/pagemap information now made availabile to the user correctly identifies and reports guard regions. As a part of this change, update vm_util.h to add the new bit (note there is no header file in the kernel where this is exposed, the user is expected to provide their own mask) and utilise the helper functions there for pagemap functionality. [lorenzo.stoakes@oracle.com: fixup define name] Link: https://lkml.kernel.org/r/32e83941-e6f5-42ee-9292-a44c16463cf1@lucifer.local Link: https://lkml.kernel.org/r/164feb0a43ae72650e6b20c3910213f469566311.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	fs/proc/task_mmu: add guard region bit to pagemap	Lorenzo Stoakes
	Patch series "fs/proc/task_mmu: add guard region bit to pagemap". Currently there is no means of determining whether a given page in a mapping range is designated a guard region (as installed via madvise() using the MADV_GUARD_INSTALL flag). This is generally not an issue, but in some instances users may wish to determine whether this is the case. This series adds this ability via /proc/$pid/pagemap, updates the documentation and adds a self test to assert that this functions correctly. This patch (of 2): Currently there is no means by which users can determine whether a given page in memory is in fact a guard region, that is having had the MADV_GUARD_INSTALL madvise() flag applied to it. This is intentional, as to provide this information in VMA metadata would contradict the intent of the feature (providing a means to change fault behaviour at a page table level rather than a VMA level), and would require VMA metadata operations to scan page tables, which is unacceptable. In many cases, users have no need to reflect and determine what regions have been designated guard regions, as it is the user who has established them in the first place. But in some instances, such as monitoring software, or software that relies upon being able to ascertain the nature of mappings within a remote process for instance, it becomes useful to be able to determine which pages have the guard region marker applied. This patch makes use of an unused pagemap bit (58) to provide this information. This patch updates the documentation at the same time as making the change such that the implementation of the feature and the documentation of it are tied together. Link: https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/521d99c08b975fb06a1e7201e971cc24d68196d1.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm: swap: remove stale comment of swap_reclaim_full_clusters()	Kemeng Shi
	swap_reclaim_full_clusters() has no return value now, just remove the stale comment which says swap_reclaim_full_clusters() wil return a bool value. Link: https://lkml.kernel.org/r/20250222160850.505274-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: correct comment in swap_usage_sub()	Kemeng Shi
	We will add si back to plist in swap_usage_sub(), just correct the wrong comment which says we will remove si from plist in swap_usage_sub(). Link: https://lkml.kernel.org/r/20250222160850.505274-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	mm, swap: remove setting SWAP_MAP_BAD for discard cluster	Kemeng Shi
	Before alloc from a cluster, we will aqcuire cluster's lock and make sure it is usable by cluster_is_usable(), so there is no need to set SWAP_MAP_BAD for cluster to be discarded. Link: https://lkml.kernel.org/r/20250222160850.505274-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	selftests/mm/mlock: print error on failure	Brendan Jackman
	It's not really possible to start diagnosing this without knowing the actual error. Also update the mlock2 helper to behave like libc would by setting errno and returning -1. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-12-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	selftests/mm: skip mlock tests if nobody user can't read it	Brendan Jackman
	If running from a directory that can't be read by unprivileged users, executing on-fault-test via the nobody user will fail. The kselftest build does give the file the correct permissions, but after being installed it might be in a directory without global execute permissions. Since the script can't safely fix that, just skip if it happens. Note that the stderr of the `ls` command is unfiltered meaning the user sees a "permission denied" error that can help inform them why the test was skipped. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-11-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	selftests/mm: ensure uffd-wp-mremap gets pages of each size	Brendan Jackman
	This test allocates a page of every available size and doesn't have any SKIP logic if the allocation fails. So, ensure it's available and skip the test if we can't do so. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-10-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16	selftests/mm: drop unnecessary sudo usage	Brendan Jackman
	This script must be run as root anyway (see all the writing to privileged files in /proc etc). Remove the unnecessary use of sudo to avoid breaking on single-user systems that don't have sudo. This also avoids confusing readers. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-9-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>