summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-22mm/mempolicy: implement the sysfs-based weighted_interleave interfaceRakie Kim
Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs extension", v5. Weighted interleave is a new interleave policy intended to make use of heterogeneous memory environments appearing with CXL. The existing interleave mechanism does an even round-robin distribution of memory across all nodes in a nodemask, while weighted interleave distributes memory across nodes according to a provided weight. (Weight = # of page allocations per round) Weighted interleave is intended to reduce average latency when bandwidth is pressured - therefore increasing total throughput. In other words: It allows greater use of the total available bandwidth in a heterogeneous hardware environment (different hardware provides different bandwidth capacity). As bandwidth is pressured, latency increases - first linearly and then exponentially. By keeping bandwidth usage distributed according to available bandwidth, we therefore can reduce the average latency of a cacheline fetch. A good explanation of the bandwidth vs latency response curve: https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/ From the article: ``` Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth. Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with the bandwidth demand of the system due to contention overhead by numerous memory requests. Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the contention latency which can be as much as twice the idle latency or more. Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth. ``` As a general rule of thumb: * If bandwidth usage is low, latency does not increase. It is optimal to place data in the nearest (lowest latency) device. * If bandwidth usage is high, latency increases. It is optimal to place data such that bandwidth use is optimized per-device. This is the top line goal: Provide a user a mechanism to target using the "maximum sustained bandwidth" of each hardware component in a heterogenous memory system. For example, the stream benchmark demonstrates that 1:1 (default) interleave is actively harmful, while weighted interleave can be beneficial. Default interleave distributes data such that too much pressure is placed on devices with lower available bandwidth. Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM) Global means the task-policy was set (set_mempolicy), while targeted means VMA policies were set (mbind2). We see weighted interleave is not always beneficial when applied globally, but is always beneficial when applied to bandwidth-driving memory regions. There are 4 patches in this set: 1) Implement system-global interleave weights as sysfs extension in mm/mempolicy.c. These weights are RCU protected, and a default weight set is provided (all weights are 1 by default). In future work, we intend to expose an interface for HMAT/CDAT code to set reasonable default values based on the memory configuration of the system discovered at boot/hotplug. 2) A mild refactor of some interleave-logic for re-use in the new weighted interleave logic. 3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind 4) Protect interleave logic (weighted and normal) with the mems_allowed seq cookie. If the nodemask changes while accessing it during a rebind, just retry the access. Included below are some performance and LTP test information, and a sample numactl branch which can be used for testing. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. = LTP Testing Summary = existing mempolicy & mbind tests: pass mempolicy & mbind + weighted interleave (global weights): pass = version history v5: - style fixes - mems_allowed cookie protection to detect rebind issues, prevents spurious allocation failures and/or mis-allocations - sparse warning fixes related to __rcu on local variables ===================================================================== Performance tests - MLC From - Ravi Jonnalagadda <ravis.opensrc@micron.com> Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream From - Gregory Price <gregory.price@memverge.com> Hardware: Single socket, single CXL expander numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind2 weights : +2.5% to +4% (consistently better than DRAM) dram only: numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Function Direction BestRateMBs AvgTime MinTime MaxTime Copy: 0->0 200923.2 0.032662 0.031853 0.033301 Scale: 0->0 202123.0 0.032526 0.031664 0.032970 Add: 0->0 208873.2 0.047322 0.045961 0.047884 Triad: 0->0 208523.8 0.047262 0.046038 0.048414 CXL-only: numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 22209.7 0.288661 0.288162 0.289342 Scale: 0->0 22288.2 0.287549 0.287147 0.288291 Add: 0->0 24419.1 0.393372 0.393135 0.393735 Triad: 0->0 24484.6 0.392337 0.392083 0.394331 Based on the above, the optimal weights are ~9:1 echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1 echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2 default interleave: numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 44666.2 0.143671 0.143285 0.144174 Scale: 0->0 44781.6 0.143256 0.142916 0.143713 Add: 0->0 48600.7 0.197719 0.197528 0.197858 Triad: 0->0 48727.5 0.197204 0.197014 0.197439 global weighted interleave: numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 190085.9 0.034289 0.033669 0.034645 Scale: 0->0 207677.4 0.031909 0.030817 0.033061 Add: 0->0 202036.8 0.048737 0.047516 0.053409 Triad: 0->0 217671.5 0.045819 0.044103 0.046755 targted regions w/ global weights (modified stream to mbind2 malloc'd regions)) numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc Copy: 0->0 205827.0 0.031445 0.031094 0.031984 Scale: 0->0 208171.8 0.031320 0.030744 0.032505 Add: 0->0 217352.0 0.045087 0.044168 0.046515 Triad: 0->0 216884.8 0.045062 0.044263 0.046982 ===================================================================== Performance tests - XSBench From - Hyeongtak Ji <hyeongtak.ji@sk.com> Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. Performance tests - XSBench 1. dram only $ numactl -m 0 ./XSBench -s XL –p 5000000 Runtime: 36.235 seconds Lookups/s: 4,691,618 2. default interleave $ numactl –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 55.243 seconds Lookups/s: 3,077,293 3. weighted interleave numactl –w –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 29.262 seconds Lookups/s: 5,809,513 ===================================================================== LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2 = Existing tests set_mempolicy, get_mempolicy, mbind MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but did not adjust tests for weighting. Basically the weights were set to 1, which is the default, and it should behave the same as MPOL_INTERLEAVE if logic is correct. == set_mempolicy01 : passed 18, failed 0 == set_mempolicy02 : passed 10, failed 0 == set_mempolicy03 : passed 64, failed 0 == set_mempolicy04 : passed 32, failed 0 == set_mempolicy05 - n/a on non-x86 == set_mempolicy06 : passed 10, failed 0 this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE == set_mempolicy07 : passed 32, failed 0 set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE == get_mempolicy01 : passed 12, failed 0 change: added MPOL_WEIGHTED_INTERLEAVE == get_mempolicy02 : passed 2, failed 0 == mbind01 : passed 15, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind02 : passed 4, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind03 : passed 16, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind04 : passed 48, failed 0 added MPOL_WEIGHTED_INTERLEAVE ===================================================================== numactl (set_mempolicy) w/ global weighting test numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master command: numactl -w --interleave=0,1 ./eatmem result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4 50% distribution is correct result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4 16.666% distribution is correct result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4 16.666% distribution is correct #include <stdio.h> #include <stdlib.h> #include <string.h> int main (void) { char* mem = malloc(1024*1024*256); memset(mem, 1, 1024*1024*256); for (int i = 0; i < ((1024*1024*256)/4096); i++) { mem = malloc(4096); mem[0] = 1; } printf("done\n"); getchar(); return 0; } This patch (of 4): This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1. Link: https://lkml.kernel.org/r/20240202170238.90004-1-gregory.price@memverge.com Link: https://lkml.kernel.org/r/20240202170238.90004-2-gregory.price@memverge.com Suggested-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Gregory Price <gregory.price@memverge.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Hasan Al Maruf <Hasan.Maruf@amd.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mmap: use SZ_{8K, 128K} helper macroYajun Deng
Use SZ_{8K, 128K} helper macro instead of the number in init_user_reserve and reserve_mem_notifier. This is more readable. Link: https://lkml.kernel.org/r/20240131031913.2058597-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Docs/translations/damon/usage: update for monitor_on renamingSeongJae Park
Update DAMON debugfs interface sections on the translated usage documents to reflect the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'. Link: https://lkml.kernel.org/r/20240130013549.89538-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Docs/admin-guide/mm/damon/usage: update for monitor_on renamingSeongJae Park
Update DAMON debugfs interface sections on the usage document to reflect the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'. Link: https://lkml.kernel.org/r/20240130013549.89538-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATEDSeongJae Park
Kernel builders could silently enable CONFIG_DAMON_DBGFS_DEPRECATED. Users who manually check the files under the DAMON debugfs directory could notice the deprecation owing to the 'DEPRECATED' DAMON debugfs file, but there could be users who doesn't manually check the files. Make the deprecation cannot be ignored in the case by renaming 'monitor_on' file, which is essential for real use of DAMON on runtime, to 'monitor_on_DEPRECATED'. Still users who control DAMON via only user-space tool could ignore the deprecation, but that's what the tool developers should take care of. DAMON user-space tool, damo, has also made a change[1] for the purpose. [1] commit 935dae76f2aee ("_damon_args: Rename --damon_interface to --damon_interface_DEPRECATED") of https://github.com/awslabs/damo Link: https://lkml.kernel.org/r/20240130013549.89538-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22selftets/damon: prepare for monitor_on file renamingSeongJae Park
Following change will rename 'monitor_on' DAMON debugfs file to 'monitor_on_DEPRECATED', to make the deprecation unignorable in runtime. Since it could make DAMON selftests fail and disturb future bisects, update DAMON selftests to support the change. Link: https://lkml.kernel.org/r/20240130013549.89538-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON debugfs ↵SeongJae Park
interface Document the newly added DAMON debugfs interface deprecation notice file on the usage document. Link: https://lkml.kernel.org/r/20240130013549.89538-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/damon/dbgfs: make debugfs interface deprecation message a macroSeongJae Park
DAMON debugfs interface deprecation message is written twice, once for the warning, and again for DEPRECATED file's read output. De-duplicate those by defining the message as a macro and reuse. [akpm@linux-foundation.org: s/comnst/const/] Link: https://lkml.kernel.org/r/20240130013549.89538-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/damon/dbgfs: implement deprecation notice fileSeongJae Park
Implement a read-only file for DAMON debugfs interface deprecation notice, to let users who manually read/write the DAMON debugfs files from their shell command line easily notice the fact. [arnd@arndb.de: fix bogus string length] Link: https://lkml.kernel.org/r/20240202124339.892862-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240130013549.89538-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/damon: rename CONFIG_DAMON_DBGFS to DAMON_DBGFS_DEPRECATEDSeongJae Park
DAMON debugfs interface is deprecated. The fact has documented by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs interface deprecation notice"). Commit 620932cd2852 ("mm/damon/dbgfs: print DAMON debugfs interface deprecation message") further started printing a warning message when users still use it. Many people don't read documentation or kernel log, though. Make the deprecation harder to be ignored using the approach of commit eb07c4f39c3e ("mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED"). 'make oldconfig' with 'CONFIG_DAMON_DBGFS=y' will get a new prompt with the explicit deprecation notice on the name. 'make olddefconfig' with 'CONFIG_DAMON_DBGFS=y' will result in not building DAMON debugfs interface. If there is a real user of DAMON debugfs interface, they will complain the change to the builder. Link: https://lkml.kernel.org/r/20240130013549.89538-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints exampleSeongJae Park
Patch series "mm/damon: make DAMON debugfs interface deprecation unignorable". DAMON debugfs interface is deprecated in February 2023, by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs interface deprecation notice"). Make the fact unable to be easily ignored by removing an example usage from the document (patch 1), renaming the config (patch 2), adding a deprecation notice file to the debugfs directory (patches 3-5), and renaming the debugfs file that essnetial to be used for real use of DAMON (patches 6-9). This patch (of 9): DAMON tracepoints example on the DAMON usage document is using DAMON debugfs interface, which is deprecated. Use its alternative, DAMON sysfs interface. Link: https://lkml.kernel.org/r/20240130013549.89538-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240130013549.89538-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: shrink_memcg_cbJohannes Weiner
shrink_memcg_cb() is called by the shrinker and is based on zswap_writeback_entry(). Move it in between. Save one fwd decl. Link: https://lkml.kernel.org/r/20240130014208.565554-21-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: writebackJohannes Weiner
Shrinking needs writeback. Naturally, move the writeback code above the shrinking code. Delete the forward decl. Link: https://lkml.kernel.org/r/20240130014208.565554-20-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: per-cpu compression infraJohannes Weiner
The per-cpu compression init/exit callbacks are awkwardly in the middle of the shrinker code. Move them up to the compression section. Link: https://lkml.kernel.org/r/20240130014208.565554-19-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: compress & decompress functionsJohannes Weiner
Writeback needs to decompress. Move the (de)compression API above what will be the consolidated shrinking/writeback code. Link: https://lkml.kernel.org/r/20240130014208.565554-18-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: move entry section out of tree sectionJohannes Weiner
The higher-level entry operations modify the tree, so move the entry API after the tree section. Link: https://lkml.kernel.org/r/20240130014208.565554-17-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: move entry sections out of LRU sectionJohannes Weiner
This completes consolidation of the LRU section. Link: https://lkml.kernel.org/r/20240130014208.565554-16-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: public lru apiJohannes Weiner
The zswap entry section sits awkwardly in the middle of LRU-related functions. Group the external LRU API functions first. Link: https://lkml.kernel.org/r/20240130014208.565554-15-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: pool paramsJohannes Weiner
Patch series "mm: zswap: cleanups". Cleanups and maintenance items that accumulated while reviewing zswap patches. This patch (of 20): The parameters primarily control pool attributes. Move those operations up to the pool section. Link: https://lkml.kernel.org/r/20240130014208.565554-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20240130014208.565554-14-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: zswap_poolsJohannes Weiner
Move the operations against the global zswap_pools list (current pool, last, find) to the pool section. Link: https://lkml.kernel.org/r/20240130014208.565554-13-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: pool refcountingJohannes Weiner
Move pool refcounting functions into the pool section. First the destroy functions, then the get and put which uses them. __zswap_pool_empty() has an upward reference to the global zswap_pools, to sanity check it's not the currently active pool that's being freed. That gets the forward decl for zswap_pool_current(). This puts the get and put function above all callers, so kill the forward decls as well. Link: https://lkml.kernel.org/r/20240130014208.565554-12-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: function ordering: pool alloc & freeJohannes Weiner
The function ordering in zswap.c is a little chaotic, which requires jumping in unexpected directions when following related code. This is a series of patches that brings the file into the following order: - pool functions - lru functions - rbtree functions - zswap entry functions - compression/backend functions - writeback & shrinking functions - store, load, invalidate, swapon, swapoff - debugfs - init But it has to be split up such the moving still produces halfway readable diffs. In this patch, move pool allocation and freeing functions. Link: https://lkml.kernel.org/r/20240130014208.565554-11-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: simplify zswap_invalidate()Johannes Weiner
The branching is awkward and duplicates code. The comment about writeback is also misleading: yes, the entry might have been written back. Or it might have never been stored in zswap to begin with due to a rejection - zswap_invalidate() is called on all exiting swap entries. Link: https://lkml.kernel.org/r/20240130014208.565554-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: further cleanup zswap_store()Johannes Weiner
- Remove dupentry, reusing entry works just fine. - Rename pool to shrink_pool, as this one actually is confusing. - Remove page, use folio_nid() and kmap_local_folio() directly. - Set entry->swpentry in a common path. - Move value and src to local scope of use. Link: https://lkml.kernel.org/r/20240130014208.565554-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: break out zwap_compress()Johannes Weiner
zswap_store() is long and mixes work at the zswap layer with work at the backend and compression layer. Move compression & backend work to zswap_compress(), mirroring zswap_decompress(). Link: https://lkml.kernel.org/r/20240130014208.565554-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: rename __zswap_load() to zswap_decompress()Johannes Weiner
Link: https://lkml.kernel.org/r/20240130014208.565554-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: clean up zswap_entry_put()Johannes Weiner
Remove stale comment and unnecessary local variable. Link: https://lkml.kernel.org/r/20240130014208.565554-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: warn when referencing a dead entryJohannes Weiner
Put a standard sanity check on zswap_entry_get() for UAF scenario. Link: https://lkml.kernel.org/r/20240130014208.565554-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: move zswap_invalidate_entry() to related functionsJohannes Weiner
Move it up to the other tree and refcounting functions. Link: https://lkml.kernel.org/r/20240130014208.565554-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: inline and remove zswap_entry_find_get()Johannes Weiner
There is only one caller and the function is trivial. Inline it. Link: https://lkml.kernel.org/r/20240130014208.565554-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: rename zswap_free_entry to zswap_entry_freeJohannes Weiner
There is a zswap_entry_ namespace with multiple functions already. Link: https://lkml.kernel.org/r/20240130014208.565554-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/list_lru: remove list_lru_putback()Chengming Zhou
Since the only user zswap_lru_putback() has gone, remove list_lru_putback() too. Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-3-b10479847099@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: fix race between lru writeback and swapoffChengming Zhou
LRU writeback has race problem with swapoff, as spotted by Yosry [1]: CPU1 CPU2 shrink_memcg_cb swap_off list_lru_isolate zswap_invalidate zswap_swapoff kfree(tree) // UAF spin_lock(&tree->lock) The problem is that the entry in lru list can't protect the tree from being swapoff and freed, and the entry also can be invalidated and freed concurrently after we unlock the lru lock. We can fix it by moving the swap cache allocation ahead before referencing the tree, then check invalidate race with tree lock, only after that we can safely deref the entry. Note we couldn't deref entry or tree anymore after we unlock the folio, since we depend on this to hold on swapoff. So this patch moves all tree and entry usage to zswap_writeback_entry(), we only use the copied swpentry on the stack to allocate swap cache and if returned with folio locked we can reference the tree safely. Then we can check invalidate race with tree lock, the following things is much the same like zswap_load(). Since we can't deref the entry after zswap_writeback_entry(), we can't use zswap_lru_putback() anymore, instead we rotate the entry in the beginning. And it will be unlinked and freed when invalidated if writeback success. Another change is we don't update the memcg nr_zswap_protected in the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with swapin or concurrent shrinker action, since swapin already have memcg nr_zswap_protected updated, don't need double counts here. For concurrent shrinker, the folio will be writeback and freed anyway. -ENOMEM case is extremely rare and doesn't happen spuriously either, so don't bother distinguishing this case. [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chris Li <chriscli@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/mm: clarify "prev" usage in switch_mm_irqs_off()Yosry Ahmed
In the x86 implementation of switch_mm_irqs_off(), we do not use the "prev" argument passed in by the caller, we use exclusively use "real_prev", which is cpu_tlbstate.loaded_mm. This is not obvious at the first sight. Furthermore, a comment describes a condition that happens when called with prev == next, but this should not affect the function in any way since prev is unused. Apparently, the comment is intended to clarify why we don't rely on prev == next to decide whether we need to update CR3, but again, it is not obvious. The comment also references the fact that leave_mm() calls with prev == NULL and tsk == NULL, but this also shouldn't matter because prev is unused and tsk is only used in one function which has a NULL check. Clarify things by renaming (prev -> unused) and (real_prev -> prev), also move and rewrite the comment as an explanation for why we don't rely on "prev" supplied by the caller in x86 code and use our own. Hopefully this makes reading the code easier. Link: https://lkml.kernel.org/r/20240126080644.1714297-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/mm: delete unused cpu argument to leave_mm()Yosry Ahmed
The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm"), delete it. Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm and cache_info: remove unnecessary CPU cache info updateHuang Ying
For each CPU hotplug event, we will update per-CPU data slice size and corresponding PCP configuration for every online CPU to make the implementation simple. But, Kyle reported that this takes tens seconds during boot on a machine with 34 zones and 3840 CPUs. So, in this patch, for each CPU hotplug event, we only update per-CPU data slice size and corresponding PCP configuration for the CPUs that share caches with the hotplugged CPU. With the patch, the system boot time reduces 67 seconds on the machine. Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Originally-by: Kyle Meyer <kyle.meyer@hpe.com> Reported-and-tested-by: Kyle Meyer <kyle.meyer@hpe.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22kswapd: replace try_to_freeze() with kthread_freezable_should_stop()Levi Yun
Instead of using try_to_freeze, use kthread_freezable_should_stop in kswapd. By this, we can avoid unnecessary freezing when kswapd should stop. Link: https://lkml.kernel.org/r/20240126152556.58791-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: memcg: don't periodically flush stats when memcg is disabledT.J. Mercier
The root memcg is onlined even when memcg is disabled. When it's onlined a 2 second periodic stat flush is started, but no stat flushing is required when memcg is disabled because there can be no child memcgs. Most calls to flush memcg stats are avoided when memcg is disabled as a result of the mem_cgroup_disabled check added in 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"), but the periodic flushing started in mem_cgroup_css_online is not. Skip it. Link: https://lkml.kernel.org/r/20240126211927.1171338-1-tjmercier@google.com Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Li <chrisl@kernel.org> Reported-by: Minchan Kim <minchan@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Koutn <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22selftests/mm: new test that steals pagesBreno Leitao
This test stresses the race between of madvise(DONTNEED), a page fault and a parallel huge page mmap, which should fail due to lack of available page available for mapping. This test case must run on a system with one and only one huge page available. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages During setup, the test allocates the only available page, and starts three threads: - thread 1: * madvise(MADV_DONTNEED) on the allocated huge page - thread 2: * Write to the allocated huge page - thread 3: * Tries to allocated (steal) an extra huge page (which is not available) thread 3 should never succeed in the allocation, since the only huge page was never unmapped, and should be reserved. Touching the old page after thread3 allocation will raise a SIGBUS. Link: https://lkml.kernel.org/r/20240105155419.1939484-2-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: kmsan: remove runtime checks from kmsan_unpoison_memory()Alexander Potapenko
Similarly to what's been done in commit 85716a80c16d ("kmsan: allow using __msan_instrument_asm_store() inside runtime"), it should be safe to call kmsan_unpoison_memory() from within the runtime, as it does not allocate memory or take locks. Remove the redundant runtime checks. This should fix false positives seen with CONFIG_DEBUG_LIST=y when the non-instrumented lib/stackdepot.c failed to unpoison the memory chunks later checked by the instrumented lib/list_debug.c Also replace the implementation of kmsan_unpoison_entry_regs() with a call to kmsan_unpoison_memory(). Link: https://lkml.kernel.org/r/20240124173134.1165747-1-glider@google.com Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Signed-off-by: Alexander Potapenko <glider@google.com> Tested-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22highmem: add kernel-doc for memcpy_*_folio()Matthew Wilcox (Oracle)
This was inadvertently skipped when adding the new functions. Link: https://lkml.kernel.org/r/20240124181217.1761674-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22dax: add a sysfs knob to control memmap_on_memory behaviorVishal Verma
Add a sysfs knob for dax devices to control the memmap_on_memory setting if the dax device were to be hotplugged as system memory. The default memmap_on_memory setting for dax devices originating via pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to preserve legacy behavior. For dax devices via CXL, the default is on. The sysfs control allows the administrator to override the above defaults if needed. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-5-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang, Ying <ying.huang@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory_hotplug: export mhp_supports_memmap_on_memory()Vishal Verma
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-4-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Documentatiion/ABI: add ABI documentation for sys-bus-daxVishal Verma
Add the missing sysfs ABI documentation for the device DAX subsystem. Various ABI attributes under this have been present since v5.1, and more have been added over time. In preparation for adding a new attribute, add this file with the historical details. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-3-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22dax/bus.c: replace several sprintf() with sysfs_emit()Vishal Verma
There were several places where drivers/dax/bus.c uses 'sprintf' to print sysfs data. Since a sysfs_emit() helper is available specifically for this purpose, replace all the sprintf() usage for sysfs with sysfs_emit() in this file. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-2-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22dax/bus.c: replace driver-core lock usage by a local rwsemVishal Verma
Patch series "Add DAX ABI for memmap_on_memory", v7. This series adds sysfs ABI to control memmap_on_memory behavior for DAX devices. Patch 1 replaces incorrect device_lock() usage with a local rwsem - this was identified during review. Patch 2 is also a preparatory patch that replaces sprintf() for sysfs operations with sysfs_emit() Patch 3 adds the missing documentation for the sysfs ABI for DAX regions and Dax devices. Patch 4 exports mhp_supports_memmap_on_memory(). Patch 5 adds the new ABI for toggling memmap_on_memory semantics for dax devices. This patch (of 5): The dax driver incorrectly used driver-core device locks to protect internal dax region and dax device configuration structures. Replace the device lock usage with a local rwsem, one each for dax region configuration and dax device configuration. As a result of this conversion, no device_lock() usage remains in dax/bus.c. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-0-20d16cb8d23d@intel.com Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-1-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: remove unused tree argument in zswap_entry_put()Yosry Ahmed
Commit 7310895779624 ("mm: zswap: tighten up entry invalidation") removed the usage of tree argument, delete it. Link: https://lkml.kernel.org/r/20240125081423.1200336-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mmap: introduce vma_set_range()Yajun Deng
There is a lot of code needs to set the range of vma in mmap.c, introduce vma_set_range() to simplify the code. Link: https://lkml.kernel.org/r/20240124035719.3685193-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: zswap: remove unnecessary trees cleanups in zswap_swapoff()Yosry Ahmed
During swapoff, try_to_unuse() makes sure that zswap_invalidate() is called for all swap entries before zswap_swapoff() is called. This means that all zswap entries should already be removed from the tree. Simplify zswap_swapoff() by removing the trees cleanup code, and leave an assertion in its place. Link: https://lkml.kernel.org/r/20240124045113.415378-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: swap: enforce updating inuse_pages at the end of swap_range_free()Yosry Ahmed
Patch series "mm: zswap: simplify zswap_swapoff()", v2. These patches aim to simplify zswap_swapoff() by removing the unnecessary trees cleanup code. Patch 1 makes sure that the order of operations during swapoff is enforced correctly, making sure the simplification in patch 2 is correct in a future-proof manner. This patch (of 2): In swap_range_free(), we update inuse_pages then do some cleanups (arch invalidation, zswap invalidation, swap cache cleanups, etc). During swapoff, try_to_unuse() checks that inuse_pages is 0 to make sure all swap entries are freed. Make sure we only update inuse_pages after we are done with the cleanups in swap_range_free(), and use the proper memory barriers to enforce it. This makes sure that code following try_to_unuse() can safely assume that swap_range_free() ran for all entries in thr swapfile (e.g. swap cache cleanup, zswap_swapoff()). In practice, this currently isn't a problem because swap_range_free() is called with the swap info lock held, and the swapoff code happens to spin for that after try_to_unuse(). However, this seems fragile and unintentional, so make it more relable and future-proof. This also facilitates a following simplification of zswap_swapoff(). Link: https://lkml.kernel.org/r/20240124045113.415378-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240124045113.415378-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>