summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2024-09-02gpiolib: legacy: Kill GPIOF_INIT_* definitionsAndy Shevchenko
Besides the fact that (old) drivers use wrong definitions, e.g., GPIOF_INIT_HIGH instead of GPIOF_OUT_INIT_HIGH, shrink the legacy definitions by killing those GPIOF_INIT_* completely. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Link: https://lore.kernel.org/r/20240828142554.2424189-2-andriy.shevchenko@linux.intel.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-09-02Merge tag 'memory-controller-drv-6.12' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into soc/drivers Memory controller drivers for v6.12 1. Tegra210 EMC: Driver refactoring and rework. 2. Tegra186 EMC: Drop unused function. 3. FSL WEIM: Correct fsl,weim-cs-timing property to properly validate it as an array. 4. TI AEMIF: Drop platform data support. 5. TI EMIF: Switch to of_property_read_bool(). 6. Several cleanups in multiple drivers: TI AEMIF and EMIF, Tegra EMC/MC, Atmel EBI, Samsung Exynos5422 DMC, STM32 FMC2 EBI, OMAP GPMC, PL172 and PL1353 SMC. These are mostly code simplifying around probe() like using - devm_clk_get_enabled(), - dev_err_probe(), - scoped device node handling (cleanup.h), - scoped for each OF child loops, - scoped/guard locks. * tag 'memory-controller-drv-6.12' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl: (35 commits) memory: mtk-smi: Use devm_clk_get_enabled() memory: pl353-smc: simplify with devm_clk_get_enabled() memory: pl353-smc: simplify with dev_err_probe() memory: pl172: simplify with devm_clk_get_enabled() memory: pl172: simplify with dev_err_probe() memory: omap-gpmc: simplify locking with guard() memory: emif: simplify locking with guard() memory: emif: drop unused 'irq_state' member memory: ti-aemif: Revert "memory: ti-aemif: don't needlessly iterate over child nodes" memory: ti-aemif: simplify with scoped for each OF child loop memory: ti-aemif: simplify with dev_err_probe() memory: tegra30-emc: simplify with scoped for each OF child loop memory: tegra20-emc: simplify with scoped for each OF child loop memory: tegra124-emc: simplify with scoped for each OF child loop memory: tegra-mc: simplify with scoped for each OF child loop memory: stm32-fmc2-ebi: simplify with dev_err_probe() memory: stm32-fmc2-ebi: simplify with scoped for each OF child loop memory: samsung: exynos5422-dmc: use scoped device node handling to simplify error paths memory: samsung: exynos5422-dmc: simplify dmc->dev usage memory: atmel-ebi: simplify with scoped for each OF child loop ... Link: https://lore.kernel.org/r/20240827122926.30794-1-krzysztof.kozlowski@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-09-02dt-bindings: clock: renesas: Document RZ/V2H(P) SoC CPGLad Prabhakar
Document the device tree bindings for the Renesas RZ/V2H(P) SoC Clock Pulse Generator (CPG). CPG block handles the below operations: - Generation and control of clock signals for the IP modules - Generation and control of resets - Control over booting - Low power consumption and power supply domains Also define constants for the core clocks of the RZ/V2H(P) SoC. Note the core clocks are a subset of the ones which are listed as part of section 4.4.2 of HW manual Rev.1.01 which cannot be controlled by CLKON register. Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/20240729202645.263525-2-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2024-09-02gpio: ath79: remove support for platform dataBartosz Golaszewski
There are no more board files defining platform data for this driver so remove the header and support from the driver. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20240821121456.19553-4-brgl@bgdev.pl Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-09-02RDMA/bnxt_re: Share a page to expose per SRQ info with userspaceChandramohan Akula
Gen P7 adapters needs to share a toggle bits information received in kernel driver with the user space. User space needs this info to arm the SRQ. User space application can get this page using the UAPI routines. Library will mmap this page and get the toggle bits to be used in the next ARM Doorbell. Uses a hash list to map the SRQ structure from the SRQ ID. SRQ structure is retrieved from the hash list while the library calls the UAPI routine to get the toggle page mapping. Currently the full page is mapped per SRQ. This can be optimized to enable multiple SRQs from the same application share the same page and different offsets in the page Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-01lib/string_choices: add str_true_false()/str_false_true() helperHongbo Li
Add str_true_false()/str_false_true() helper to retur a "true" or "false" string literal. We found more than 10 cases currently exist in the tree. So these helpers can be used for these cases. This patch (of 3): Add str_true_false()/str_false_true() helper to return "true" or "false" string literal. Link: https://lkml.kernel.org/r/20240827024517.914100-1-lihongbo22@huawei.com Link: https://lkml.kernel.org/r/20240827024517.914100-2-lihongbo22@huawei.com Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Trond Myklebust <trondmy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01ratelimit: convert flags to int to save 8 bytes in sizeMateusz Guzik
Only bit 1 is used, making an unsigned long a total overkill. This brings it from 40 to 32 bytes, which in turn shrinks user_struct from 136 to 128 bytes. Since the latter is allocated with hwalign, this means the total usage goes down from 192 to 128 bytes per object. No functional changes. Link: https://lkml.kernel.org/r/20240817123754.240924-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01fault-inject: improve build for CONFIG_FAULT_INJECTION=nJani Nikula
The fault-inject.h users across the kernel need to add a lot of #ifdef CONFIG_FAULT_INJECTION to cater for shortcomings in the header. Make fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and should_fail() to allow removal of conditional compilation. [akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h] [akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c] [akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen] Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com Fixes: 6ff1cb355e62 ("[PATCH] fault-injection capabilities infrastructure") Signed-off-by: Jani Nikula <jani.nikula@intel.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01lib/lru_cache: fix spelling mistake "colision"->"collision"Deshan Zhang
There is a spelling mistake in a literal string and in cariable names. Fix these. Link: https://lkml.kernel.org/r/20240725093044.1742842-1-deshan@nfschina.com Signed-off-by: Deshan Zhang <deshan@nfschina.com> Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01xz: remove XZ_EXTERN and extern from functionsLasse Collin
XZ_EXTERN was used to make internal functions static in the preboot code. However, in other decompressors this hasn't been done. On x86-64, this makes no difference to the kernel image size. Omit XZ_EXTERN and let some of the internal functions be extern in the preboot code. Omitting XZ_EXTERN from include/linux/xz.h fixes warnings in "make htmldocs" and makes the intradocument links to xz_dec functions work in Documentation/staging/xz.rst. The alternative would have been to add "XZ_EXTERN" to c_id_attributes in Documentation/conf.py but omitting XZ_EXTERN seemed cleaner. Link: https://lore.kernel.org/lkml/20240723205437.3c0664b0@kaneli/ Link: https://lkml.kernel.org/r/20240724110544.16430-1-lasse.collin@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sam James <sam@gentoo.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joel Stanley <joel@jms.id.au> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Jules Maselbas <jmaselbas@zdiv.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rui Li <me@lirui.org> Cc: Simon Glass <sjg@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01xz: improve the MicroLZMA kernel-doc in xz.hLasse Collin
Move the description of the format into a "DOC:" comment. Emphasize that MicroLZMA functions aren't usually needed. Link: https://lkml.kernel.org/r/20240721133633.47721-7-lasse.collin@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reviewed-by: Sam James <sam@gentoo.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Jules Maselbas <jmaselbas@zdiv.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rui Li <me@lirui.org> Cc: Simon Glass <sjg@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01xz: fix kernel-doc formatting errors in xz.hLasse Collin
The opaque structs xz_dec and xz_dec_microlzma are declared in xz.h but their definitions are in xz_dec_lzma2.c without kernel-doc comments. Use regular comments for these structs in xz.h to avoid errors when building the docs. Add a few missing colons. Link: https://lkml.kernel.org/r/20240721133633.47721-6-lasse.collin@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reviewed-by: Sam James <sam@gentoo.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Jules Maselbas <jmaselbas@zdiv.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rui Li <me@lirui.org> Cc: Simon Glass <sjg@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01xz: switch from public domain to BSD Zero Clause License (0BSD)Lasse Collin
Remove the public domain notices and add SPDX license identifiers. Change MODULE_LICENSE from "GPL" to "Dual BSD/GPL" because 0BSD should count as a BSD license variant here. The switch to 0BSD was done in the upstream XZ Embedded project because public domain has (real or perceived) legal issues in some jurisdictions. Link: https://lkml.kernel.org/r/20240721133633.47721-4-lasse.collin@tukaani.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reviewed-by: Sam James <sam@gentoo.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Jules Maselbas <jmaselbas@zdiv.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rui Li <me@lirui.org> Cc: Simon Glass <sjg@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01memcg: allocate v1 event percpu only on v1 deploymentShakeel Butt
Currently memcg->events_percpu gets allocated on v2 deployments. Let's move the allocation to v1 only codebase. This is not needed in v2. Link: https://lkml.kernel.org/r/20240815050453.1298138-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01memcg: move v1 only percpu stats in separate structShakeel Butt
Patch series "memcg: further decouple v1 code from v2". Some of the v1 code is still in v2 code base due to v1 fields in the struct memcg_vmstats_percpu. This field decouples those fileds from v2 struct and move all the related code into v1 only code base. This patch (of 7): At the moment struct memcg_vmstats_percpu contains two v1 only fields which consumes memory even when CONFIG_MEMCG_V1 is not enabled. In addition there are v1 only functions accessing them and are in the main memcontrol source file and can not be moved to v1 only source file due to these fields. Let's move these fields into their own struct. Later patches will move the functions accessing them to v1 source file and only allocate these fields when CONFIG_MEMCG_V1 is enabled. Link: https://lkml.kernel.org/r/20240815050453.1298138-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240815050453.1298138-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01maple_tree: introduce store_type enumSidhartha Kumar
Patch series "Introduce a store type enum for the Maple tree", v4. ================================ OVERVIEW ================================ This series implements two work items[3]: "aligning mas_store_gfp() with mas_preallocate()" and "enum for store type". mas_store_gfp() is modified to preallocate nodes. This simplies many of the write helper functions by allowing them to use mas_store_gfp() rather than open coding node allocation and error handling. The enum defines the following store types: enum store_type { wr_invalid, wr_new_root, wr_store_root, wr_exact_fit, wr_spanning_store, wr_split_store, wr_rebalance, wr_append, wr_node_store, wr_slot_store, }; In the current maple tree code, a walk down the tree is done in mas_preallocate() to determine the number of nodes needed for this write. After node allocation, mas_wr_store_entry() will perform another walk to determine which write helper function to use to complete the write. Rather than performing the second walk, we can store the type of write in the maple write state during node allocation and read this field to complete the write. Patches 1-16 implement this store type feature. Patch 17 is a cleanup patch to change functions that have unused return types to be void. ================================ RESULTS ================================= Phoronix t-test-1 (Seconds < Lower Is Better) v6.10-rc6 Threads: 1 33.15 Threads: 2 10.81 v6.10-rc6 + this series Threads: 1 32.69 Threads: 2 10.45 Stress-ng mmap 6.10_base store_type_v4 Duration User 2744.65 2769.40 Duration System 10862.69 10817.59 Duration Elapsed 1477.58 1478.35 ================================ TESTING ================================= Testing was done with the maple tree test suite. A new test case is also added to validate the order in which we test for and assign the store type. [1]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#m81044feb66765265f8ca7f21e4b4b3725b18780a [2]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#mb36c6526486638e82518c0f37a428fb279c84d8a [3]: https://lists.infradead.org/pipermail/maple-tree/2023-December/003098.html This patch (of 17): Add a store_type enum that is stored in ma_state. This will be used to keep track of partial walks of the tree so that subsequent walks can pick up where a previous walk left off. Link: https://lkml.kernel.org/r/20240814161944.55347-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20240814161944.55347-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: kmem: add lockdep assertion to obj_cgroup_memcgMuchun Song
obj_cgroup_memcg() is supposed to safe to prevent the returned memory cgroup from being freed only when the caller is holding the rcu read lock or objcg_lock or cgroup_mutex. It is very easy to ignore thoes conditions when users call some upper APIs which call obj_cgroup_memcg() internally like mem_cgroup_from_slab_obj() (See the link below). So it is better to add lockdep assertion to obj_cgroup_memcg() to find those issues ASAP. Because there is no user of obj_cgroup_memcg() holding objcg_lock to make the returned memory cgroup safe, do not add objcg_lock assertion (We should export objcg_lock if we really want to do). Additionally, this is some internal implementation detail of memcg and should not be accessible outside memcg code. Some users like __mem_cgroup_uncharge() do not care the lifetime of the returned memory cgroup, which just want to know if the folio is charged to a memory cgroup, therefore, they do not need to hold the needed locks. In which case, introduce a new helper folio_memcg_charged() to do this. Compare it to folio_memcg(), it could eliminate a memory access of objcg->memcg for kmem, actually, a really small gain. [songmuchun@bytedance.com: fix split_page_memcg()] Link: https://lkml.kernel.org/r/20240819080415.44964-1-songmuchun@bytedance.com Link: https://lore.kernel.org/all/20240718083607.42068-1-songmuchun@bytedance.com/ Link: https://lkml.kernel.org/r/20240814093415.17634-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: remove legacy install_special_mapping() codeLinus Torvalds
All relevant architectures had already been converted to the new interface (which just has an underscore in front of the name - not very imaginative naming), this just force-converts the stragglers. The modern interface is almost identical to the old one, except instead of the page pointer it takes a "struct vm_special_mapping" that describes the mapping (and contains the page pointer as one member), and it returns the resulting 'vma' instead of just the error code. Getting rid of the old interface also gets rid of some special casing, which had caused problems with the mremap extensions to "struct vm_special_mapping". [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/CAHk-=whvR+z=0=0gzgdfUiK70JTa-=+9vxD-4T=3BagXR6dciA@mail.gmail.comTested-by: Rob Landley <rob@landley.net> # arch/sh/ Link: https://lore.kernel.org/all/20240819195120.GA1113263@thelio-3990X/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Brian Cain <bcain@quicinc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jeff Xu <jeffxu@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Landley <rob@landley.net> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: remove arch_unmap()Michael Ellerman
Now that powerpc no longer uses arch_unmap() to handle VDSO unmapping, there are no meaningful implementions left. Drop support for it entirely, and update comments which refer to it. Link: https://lkml.kernel.org/r/20240812082605.743814-3-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jeff Xu <jeffxu@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: add optional close() to struct vm_special_mappingMichael Ellerman
Add an optional close() callback to struct vm_special_mapping. It will be used, by powerpc at least, to handle unmapping of the VDSO. Although support for unmapping the VDSO was initially added for CRIU[1], it is not desirable to guard that support behind CONFIG_CHECKPOINT_RESTORE. There are other known users of unmapping the VDSO which are not related to CRIU, eg. Valgrind [2] and void-ship [3]. The powerpc arch_unmap() hook has been in place for ~9 years, with no ifdef, so there may be other unknown users that have come to rely on unmapping the VDSO. Even if the code was behind an ifdef, major distros enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO depends on that configuration option. It's also undesirable to have such core mm behaviour behind a relatively obscure CONFIG option. Longer term the unmap behaviour should be standardised across architectures, however that is complicated by the fact the VDSO pointer is stored differently across architectures. There was a previous attempt to unify that handling [4], which could be revived. See [5] for further discussion. [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 [3]: https://github.com/insanitybit/void-ship [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/ [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 Link: https://lkml.kernel.org/r/20240812082605.743814-1-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jeff Xu <jeffxu@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01filemap: add trace events for get_pages, map_pages, and faultTakaya Saeki
To allow precise tracking of page caches accessed, add new tracepoints that trigger when a process actually accesses them. The ureadahead program used by ChromeOS traces the disk access of programs as they start up at boot up. It uses mincore(2) or the 'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores this information in a "pack" file and on subsequent boots, it will read the pack file and call readahead(2) on the information so that disk storage can be loaded into RAM before the applications actually need it. A problem we see is that due to the kernel's readahead algorithm that can aggressively pull in more data than needed (to try and accomplish the same goal) and this data is also recorded. The end result is that the pack file contains a lot of pages on disk that are never actually used. Calling readahead(2) on these unused pages can slow down the system boot up times. To solve this, add 3 new trace events, get_pages, map_pages, and fault. These will be used to trace the pages are not only pulled in from disk, but are actually used by the application. Only those pages will be stored in the pack file, and this helps out the performance of boot up. With the combination of these 3 new trace events and mm_filemap_add_to_page_cache, we observed a reduction in the pack file by 7.3% - 20% on ChromeOS varying by device. Link: https://lkml.kernel.org/r/20240813100312.3930505-1-takayas@chromium.org Signed-off-by: Takaya Saeki <takayas@chromium.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Junichi Uekawa <uekawa@chromium.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/mprotect: fix dax pud handlingsPeter Xu
This is only relevant to the two archs that support PUD dax, aka, x86_64 and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not count in this case. DAX have had PUD mappings for years, but change protection path never worked. When the path is triggered in any form (a simple test program would be: call mprotect() on a 1G dev_dax mapping), the kernel will report "bad pud". This patch should fix that. The new change_huge_pud() tries to keep everything simple. For example, it doesn't optimize write bit as that will need even more PUD helpers. It's not too bad anyway to have one more write fault in the worst case once for 1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does it support userfault-wp bits, as there isn't such PUD mappings that is supported; file mappings always need a split there. The same to TLB shootdown: the pmd path (which was for x86 only) has the trick of using _ad() version of pmdp_invalidate*() which can avoid one redundant TLB, but let's also leave that for later. Again, the larger the mapping, the smaller of such effect. There's some difference on handling "retry" for change_huge_pud() (where it can return 0): it isn't like change_huge_pmd(), as the pmd version is safe with all conditions handled in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock(). In short, change_pte_range() is simply smarter. For that, change_pud_range() will need proper retry if it races with something else when a huge PUD changed from under us. The last thing to mention is currently the PUD path ignores the huge pte numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not applicable to NUMA, but also that it's ambiguous on its own to decide how to account pud in this case. In one earlier version of this patchset I proposed to remove the counter as it doesn't even look right to do the accounting as of now [1], but then a further discussion suggests we can leave that for later, as that doesn't block this series if we choose to ignore that counter. That's what this patch does, by ignoring it. When at it, touch up the comment in pgtable_split_needed() to make it generic to either pmd or pud file THPs. [1] https://lore.kernel.org/all/20240715192142.3241557-3-peterx@redhat.com/ [2] https://lore.kernel.org/r/added2d0-b8be-4108-82ca-1367a388d0b1@redhat.com Link: https://lkml.kernel.org/r/20240812181225.1360970-8-peterx@redhat.com Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/x86: implement arch_check_zapped_pud()Peter Xu
Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps. It has the same logic as the PMD helper. One thing to mention is, it might be a good idea to use page_table_check in the future for trapping wrong setups of shadow stack pgtable entries [1]. That is left for the future as a separate effort. [1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com Link: https://lkml.kernel.org/r/20240812181225.1360970-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: rework accept memory helpersKirill A. Shutemov
Make accept_memory() and range_contains_unaccepted_memory() take 'start' and 'size' arguments instead of 'start' and 'end'. Remove accept_page(), replacing it with direct calls to accept_memory(). The accept_page() name is going to be used for a different function. Link: https://lkml.kernel.org/r/20240809114854.3745464-6-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: introduce PageUnaccepted() page typeKirill A. Shutemov
The new page type allows physical memory scanners to detect unaccepted memory and handle it accordingly. The page type is serialized with zone lock. Link: https://lkml.kernel.org/r/20240809114854.3745464-5-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: remove PG_errorMatthew Wilcox (Oracle)
The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01percpu: remove pcpu_alloc_size()Jianhui Zhou
pcpu_alloc_size() was added in 7ac5c53e0073 "mm/percpu.c: introduce pcpu_alloc_size()", which is used to get the allocated memory size in bpf. However, pcpu_alloc_size() is no longer used in "bpf: Use c->unit_size to select target cache during free" because its actuall allocated memory size may change at runtime due to its slab merging mechanism. Therefore, pcpu_alloc_size() can be removed. Link: https://lkml.kernel.org/r/tencent_AD5C50E8D78C07A3CE539BD5F6BF39706507@qq.com Signed-off-by: Jianhui Zhou <912460177@qq.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: JonasZhou <JonasZhou@zhaoxin.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01kfence: introduce burst modeMarco Elver
Introduce burst mode, which can be configured with kfence.burst=$count, where the burst count denotes the additional successive slab allocations to be allocated through KFENCE for each sample interval. The idea is that this can give developers an additional knob to make KFENCE more aggressive when debugging specific issues of systems where either rebooting or recompiling the kernel with KASAN is not possible. Experiment: To assess the effectiveness of the new option, we randomly picked a recent out-of-bounds [1] and use-after-free bug [2], each with a reproducer provided by syzbot, that initially detected these bugs with KASAN. We then tried to reproduce the bugs with KFENCE below. [1] Fixed by: 7c55b78818cf ("jfs: xattr: fix buffer overflow for invalid xattr") https://syzkaller.appspot.com/bug?id=9d1b59d4718239da6f6069d3891863c25f9f24a2 [2] Fixed by: f8ad00f3fb2a ("l2tp: fix possible UAF when cleaning up tunnels") https://syzkaller.appspot.com/bug?id=4f34adc84f4a3b080187c390eeef60611fd450e1 The following KFENCE configs were compared. A pool size of 1023 objects was used for all configurations. Baseline kfence.sample_interval=100 kfence.skip_covered_thresh=75 kfence.burst=0 Aggressive kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=0 AggressiveBurst kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=1000 Each reproducer was run 10 times (after a fresh reboot), with the following detection counts for each KFENCE config: | Detection Count out of 10 | | OOB [1] | UAF [2] | ------------------+-------------+-------------+ Default | 0/10 | 0/10 | Aggressive | 0/10 | 0/10 | AggressiveBurst | 8/10 | 8/10 | With the Default and even the Aggressive configs the results are unsurprising, given KFENCE has not been designed for deterministic bug detection of small test cases. However, when enabling burst mode with relatively large burst count, KFENCE can start to detect heap memory-safety bugs even in simpler test cases with high probability (in the above cases with ~80% probability). Link: https://lkml.kernel.org/r/20240805124203.2692278-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: fix (harmless) type confusion in lock_vma_under_rcu()Jann Horn
There is a (harmless) type confusion in lock_vma_under_rcu(): After vma_start_read(), we have taken the VMA lock but don't know yet whether the VMA has already been detached and scheduled for RCU freeing. At this point, ->vm_start and ->vm_end are accessed. vm_area_struct contains a union such that ->vm_rcu uses the same memory as ->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a detached VMA is illegal and leads to type confusion between union members. Fix it by reordering the vma->detached check above the address checks, and document the rules for RCU readers accessing VMAs. This will probably change the number of observed VMA_LOCK_MISS events (since previously, trying to access a detached VMA whose ->vm_rcu has been scheduled would bail out when checking the fault address against the rcu_head members reinterpreted as VMA bounds). Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com Fixes: 50ee32537206 ("mm: introduce lock_vma_under_rcu to be used from arch-specific code") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01zswap: implement a second chance algorithm for dynamic zswap shrinkerNhat Pham
Patch series "improving dynamic zswap shrinker protection scheme", v3. When experimenting with the memory-pressure based (i.e "dynamic") zswap shrinker in production, we observed a sharp increase in the number of swapins, which led to performance regression. We were able to trace this regression to the following problems with the shrinker's warm pages protection scheme: 1. The protection decays way too rapidly, and the decaying is coupled with zswap stores, leading to anomalous patterns, in which a small batch of zswap stores effectively erase all the protection in place for the warmer pages in the zswap LRU. This observation has also been corroborated upstream by Takero Funaki (in [1]). 2. We inaccurately track the number of swapped in pages, missing the non-pivot pages that are part of the readahead window, while counting the pages that are found in the zswap pool. To alleviate these two issues, this patch series improve the dynamic zswap shrinker in the following manner: 1. Replace the protection size tracking scheme with a second chance algorithm. This new scheme removes the need for haphazard stats decaying, and automatically adjusts the pace of pages aging with memory pressure, and writeback rate with pool activities: slowing down when the pool is dominated with zswpouts, and speeding up when the pool is dominated with stale entries. 2. Fix the tracking of the number of swapins to take into account non-pivot pages in the readahead window. With these two changes in place, in a kernel-building benchmark without any cold data added, the number of swapins is reduced by 64.12%. This translate to a 10.32% reduction in build time. We also observe a 3% reduction in kernel CPU time. In another benchmark, with cold data added (to gauge the new algorithm's ability to offload cold data), the new second chance scheme outperforms the old protection scheme by around 0.7%, and actually written back around 21% more pages to backing swap device. So the new scheme is just as good, if not even better than the old scheme on this front as well. [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/ This patch (of 2): Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate. The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm. We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. [nphamcs@gmail.com: fix a small mistake in the referenced bit documentation] Link: https://lkml.kernel.org/r/20240806003403.3142387-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Takero Funaki <flintglass@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: remove follow_page()David Hildenbrand
All users are gone, let's remove it and any leftovers in comments. We'll leave any FOLL/follow_page_() naming cleanups as future work. Link: https://lkml.kernel.org/r/20240802155524.517137-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walkDavid Hildenbrand
Let's remove yet another follow_page() user. Note that we have to do the split without holding the PTL, after folio_walk_end(). We don't care about losing the secretmem check in follow_page(). [david@redhat.com: teach can_split_folio() that we are not holding an additional reference] Link: https://lkml.kernel.org/r/c75d1c6c-8ea6-424f-853c-1ccda6c77ba2@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/pagewalk: introduce folio_walk_start() + folio_walk_end()David Hildenbrand
We want to get rid of follow_page(), and have a more reasonable way to just lookup a folio mapped at a certain address, perform some checks while still under PTL, and then only conditionally grab a folio reference if really required. Further, we might want to get rid of some walk_page_range*() users that really only want to temporarily lookup a single folio at a single address. So let's add a new page table walker that does exactly that, similarly to GUP also being able to walk hugetlb VMAs. Add folio_walk_end() as a macro for now: the compiler is not easy to please with the pte_unmap()->kunmap_local(). Note that one difference between follow_page() and get_user_pages(1) is that follow_page() will not trigger faults to get something mapped. So folio_walk is at least currently not a replacement for get_user_pages(1), but could likely be extended/reused to achieve something similar in the future. Link: https://lkml.kernel.org/r/20240802155524.517137-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01include/linux/mmzone.h: clean up watermark accessorsAndrew Morton
- we have a helper wmark_pages(). Teach min_wmark_pages(), low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use it instead of open-coding its implementation. - there's no reason to implement all these things as macros. Redo them in C. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: create promo_wmark_pages and clean up open-coded sitesKaiyang Zhao
Patch series "mm: print the promo watermark in zoneinfo", v2. This patch (of 2): Define promo_wmark_pages and convert current call sites of wmark_pages with fixed WMARK_PROMO to using it instead. Link: https://lkml.kernel.org/r/20240801232548.36604-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240801232548.36604-2-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: clarify folio_likely_mapped_shared() documentation for KSM foliosDavid Hildenbrand
For KSM folios, the function actually does what it is supposed to do: even having multiple mappings inside the same MM is considered "sharing", as there is no real relationship between these KSM page mappings -- in contrast to mapping the same file range twice and having the same pagecache page mapped twice. Link: https://lkml.kernel.org/r/20240731160758.808925-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/hugetlb: remove hugetlb_follow_page_mask() leftoverDavid Hildenbrand
We removed hugetlb_follow_page_mask() in commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") but forgot to cleanup some leftovers. While at it, simplify the hugetlb comment, it's overly detailed and rather confusing. Stating that we may end up in there during coredumping is sufficient to explain the PF_DUMPCORE usage. Link: https://lkml.kernel.org/r/20240731142000.625044-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: swap: add nr argument in swapcache_prepare and swapcache_clear to ↵Barry Song
support large folios Right now, swapcache_prepare() and swapcache_clear() supports one entry only, to support large folios, we need to handle multiple swap entries. To optimize stack usage, we iterate twice in __swap_duplicate(): the first time to verify that all entries are valid, and the second time to apply the modifications to the entries. Currently, we're using nr=1 for the existing users. [v-songbaohua@oppo.com: clarify swap_count_continued and improve readability for __swap_duplicate] Link: https://lkml.kernel.org/r/20240802071817.47081-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: improve code consistency with zonelist_* helper functionsWei Yang
Replace direct access to zoneref->zone, zoneref->zone_idx, or zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper functions for consistency. No functional change. Link: https://lkml.kernel.org/r/20240729091717.464-1-shivankg@amd.com Co-developed-by: Shivank Garg <shivankg@amd.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move internal core VMA manipulation functions to own fileLorenzo Stoakes
This patch introduces vma.c and moves internal core VMA manipulation functions to this file from mmap.c. This allows us to isolate VMA functionality in a single place such that we can create userspace testing code that invokes this functionality in an environment where we can implement simple unit tests of core functionality. This patch ensures that core VMA functionality is explicitly marked as such by its presence in mm/vma.h. It also places the header includes required by vma.c in vma_internal.h, which is simply imported by vma.c. This makes the VMA functionality testable, as userland testing code can simply stub out functionality as required. Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move vma_shrink(), vma_expand() to internal headerLorenzo Stoakes
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code. To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h. The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable. Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move vma_modify() and helpers to internal headerLorenzo Stoakes
These are core VMA manipulation functions which invoke VMA splitting and merging and should not be directly accessed from outside of mm/. Link: https://lkml.kernel.org/r/5efde0c6342a8860d5ffc90b415f3989fd8ed0b2.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01userfaultfd: move core VMA manipulation logic to mm/userfaultfd.cLorenzo Stoakes
Patch series "Make core VMA operations internal and testable", v4. There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there. More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself. This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h. Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h. This means we can then re-implement vma_internal.h in userland, adding shims for kernel mechanisms as required, allowing us to unit test internal VMA functionality. This testing is useful as opposed to an e.g. kunit implementation as this way we can avoid all external kernel side-effects while testing, run tests VERY quickly, and iterate on and debug problems quickly. Excitingly this opens the door to, in the future, recreating precise problems observed in production in userland and very quickly debugging problems that might otherwise be very difficult to reproduce. This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing. Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland. A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly. This patch (of 4): This patch forms part of a patch series intending to separate out VMA logic and render it testable from userspace, which requires that core manipulation functions be exposed in an mm/-internal header file. In order to do this, we must abstract APIs we wish to test, in this instance functions which ultimately invoke vma_modify(). This patch therefore moves all logic which ultimately invokes vma_modify() to mm/userfaultfd.c, trying to transfer code at a functional granularity where possible. [lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm, memcg: cg2 memory{.swap,}.peak write handlersDavid Finkel
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7. This patch (of 2): Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API. For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS. This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local. Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD. Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter. Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons. This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems. Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case. As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory. Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com Signed-off-by: David Finkel <davidf@vimeo.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01s390/uv: drop arch_make_page_accessible()David Hildenbrand
All code was converted to using arch_make_folio_accessible(), let's drop arch_make_page_accessible(). Link: https://lkml.kernel.org/r/20240729183844.388481-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: simplify arch_make_folio_accessible()David Hildenbrand
Patch series "mm: remove arch_make_page_accessible()". Now that s390x implements arch_make_folio_accessible(), let's convert remaining users to use arch_make_folio_accessible() instead so we can remove arch_make_page_accessible(). This patch (of 3): Now that s390x implements HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, let's turn generic arch_make_folio_accessible() into a NOP: there are no other targets that implement HAVE_ARCH_MAKE_PAGE_ACCESSIBLE but not HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE. Link: https://lkml.kernel.org/r/20240729183844.388481-1-david@redhat.com Link: https://lkml.kernel.org/r/20240729183844.388481-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/hugetlb: enforce that PMD PT sharing has split PMD PT locksDavid Hildenbrand
Sharing page tables between processes but falling back to per-MM page table locks cannot possibly work. So, let's make sure that we do have split PMD locks by adding a new Kconfig option and letting that depend on CONFIG_SPLIT_PMD_PTLOCKS. Link: https://lkml.kernel.org/r/20240726150728.3159964-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig optionsDavid Hildenbrand
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/20240725183955.2268884-1-david@redhat.com This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/20240726150728.3159964-1-david@redhat.com Link: https://lkml.kernel.org/r/20240726150728.3159964-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macroRoman Gushchin
When a page_counter structure is initialized, there is no need to use an atomic set operation to initialize the usage counter because at this point the structure is not visible to anybody else. ATOMIC_LONG_INIT() is what should be used in such cases. Link: https://lkml.kernel.org/r/20240726203110.1577216-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCGRoman Gushchin
Put page_counter_calculate_protection() under CONFIG_MEMCG. The protection functionality (min/low limits) is not supported by any other cgroup subsystem, so page_counter_calculate_protection() and related static effective_protection() can be compiled out if CONFIG_MEMCG is not enabled. Link: https://lkml.kernel.org/r/20240726203110.1577216-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>