summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2022-12-01rxrpc: trace: Don't use __builtin_return_address for rxrpc_peer tracingDavid Howells
In rxrpc tracing, use enums to generate lists of points of interest rather than __builtin_return_address() for the rxrpc_peer tracepoint Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-12-01rxrpc: trace: Don't use __builtin_return_address for rxrpc_local tracingDavid Howells
In rxrpc tracing, use enums to generate lists of points of interest rather than __builtin_return_address() for the rxrpc_local tracepoint Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-12-01rxrpc: Remove the [k_]proto() debugging macrosDavid Howells
Remove the kproto() and _proto() debugging macros in preference to using tracepoints for this. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-12-01rxrpc: Remove decl for rxrpc_kernel_call_is_complete()David Howells
rxrpc_kernel_call_is_complete() has been removed, so remove its declaration too. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-12-01rxrpc: Implement an in-kernel rxperf server for testing purposesDavid Howells
Implement an in-kernel rxperf server to allow kernel-based rxrpc services to be tested directly, unlike with AFS where they're accessed by the fileserver when the latter decides it wants to. This is implemented as a module that, if loaded, opens UDP port 7009 (afs3-rmtsys) and listens on it for incoming calls. Calls can be generated using the rxperf command shipped with OpenAFS, for example. Changes ======= ver #2) - Use min_t() instead of min(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: Jakub Kicinski <kuba@kernel.org>
2022-12-01wifi: cfg80211: Correct example of ieee80211_iface_limitPhilipp Hortmann
Correct wrong closing bracket. Signed-off-by: Philipp Hortmann <philipp.g.hortmann@gmail.com> Link: https://lore.kernel.org/r/20221114200135.GA100176@matrix-ESPRIMO-P710 Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-12-01wifi: ieee80211: Do not open-code qos address offsetsKees Cook
When building with -Wstringop-overflow, GCC's KASAN implementation does not correctly perform bounds checking within some complex structures when faced with literal offsets, and can get very confused. For example, this warning is seen due to literal offsets into sturct ieee80211_hdr that may or may not be large enough: drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c: In function 'iwl_mvm_rx_mpdu_mq': drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:2022:29: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2022 | *qc &= ~IEEE80211_QOS_CTL_A_MSDU_PRESENT; In file included from drivers/net/wireless/intel/iwlwifi/mvm/fw-api.h:32, from drivers/net/wireless/intel/iwlwifi/mvm/sta.h:15, from drivers/net/wireless/intel/iwlwifi/mvm/mvm.h:27, from drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:10: drivers/net/wireless/intel/iwlwifi/mvm/../fw/api/rx.h:559:16: note: at offset [78, 166] into destination object 'mpdu_len' of size 2 559 | __le16 mpdu_len; | ^~~~~~~~ Refactor ieee80211_get_qos_ctl() to avoid using literal offsets, requiring the creation of the actual structure that is described in the comments. Explicitly choose the desired offset, making the code more human-readable too. This is one of the last remaining warning to fix before enabling -Wstringop-overflow globally. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97490 Link: https://github.com/KSPP/linux/issues/181 Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Kalle Valo <kvalo@kernel.org> Cc: Gregory Greenman <gregory.greenman@intel.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221130212641.never.627-kees@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-12-01inet: ping: use hlist_nulls rcu iterator during lookupFlorian Westphal
ping_lookup() does not acquire the table spinlock, so iteration should use hlist_nulls_for_each_entry_rcu(). Spotted during code review. Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20221129140644.28525-1-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-01vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copyJann Horn
find_timens_vvar_page() is not architecture-specific, as can be seen from how all five per-architecture versions of it are the same. (arm64, powerpc and riscv are exactly the same; x86 and s390 have two characters difference inside a comment, less blank lines, and mark the !CONFIG_TIME_NS version as inline.) Refactor the five copies into a central copy in kernel/time/namespace.c. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
2022-11-30Merge tag 'mlx5-updates-2022-11-29' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2022-11-29 Misc update for mlx5 driver 1) Various trivial cleanups 2) Maor Dickman, Adds support for trap offload with additional actions 3) From Tariq, UMR (device memory registrations) cleanups, UMR WQE must be aligned to 64B per device spec, (not a bug fix). * tag 'mlx5-updates-2022-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Support devlink reload of IPsec core net/mlx5e: TC, Add offload support for trap with additional actions net/mlx5e: Do early return when setup vports dests for slow path flow net/mlx5: Remove redundant check net/mlx5e: Delete always true DMA check net/mlx5e: Don't access directly DMA device pointer net/mlx5e: Don't use termination table when redundant net/mlx5: Fix orthography errors in documentation net/mlx5: Use generic definition for UMR KLM alignment net/mlx5: Generalize name of UMR alignment definition net/mlx5: Remove unused UMR MTT definitions net/mlx5e: Add padding when needed in UMR WQEs net/mlx5: Remove unused ctx variables net/mlx5e: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper net/mlx5e: Remove unneeded io-mapping.h #include ==================== Link: https://lore.kernel.org/r/20221130051152.479480-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30net: phy: Add link between phy dev and mac devXiaolei Wang
If the external phy used by current mac interface is managed by another mac interface, it means that this network port cannot work independently, especially when the system suspends and resumes, the following trace may appear, so we should create a device link between phy dev and mac dev. WARNING: CPU: 0 PID: 24 at drivers/net/phy/phy.c:983 phy_error+0x20/0x68 Modules linked in: CPU: 0 PID: 24 Comm: kworker/0:2 Not tainted 6.1.0-rc3-00011-g5aaef24b5c6d-dirty #34 Hardware name: Freescale i.MX6 SoloX (Device Tree) Workqueue: events_power_efficient phy_state_machine unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x90 dump_stack_lvl from __warn+0xb4/0x24c __warn from warn_slowpath_fmt+0x5c/0xd8 warn_slowpath_fmt from phy_error+0x20/0x68 phy_error from phy_state_machine+0x22c/0x23c phy_state_machine from process_one_work+0x288/0x744 process_one_work from worker_thread+0x3c/0x500 worker_thread from kthread+0xf0/0x114 kthread from ret_from_fork+0x14/0x28 Exception stack(0xf0951fb0 to 0xf0951ff8) Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20221130021216.1052230-1-xiaolei.wang@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30net: devlink: let the core report the driver name instead of the driversVincent Mailhol
The driver name is available in device_driver::name. Right now, drivers still have to report this piece of information themselves in their devlink_ops::info_get callback function. In order to factorize code, make devlink_nl_info_fill() add the driver name attribute. Now that the core sets the driver name attribute, drivers are not supposed to call devlink_info_driver_name_put() anymore. Remove devlink_info_driver_name_put() and clean-up all the drivers using this function in their callback. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Tested-by: Ido Schimmel <idosch@nvidia.com> # mlxsw Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30devlink: support directly reading from region memoryJacob Keller
To read from a region, user space must currently request a new snapshot of the region and then read from that snapshot. This can sometimes be overkill if user space only reads a tiny portion. They first create the snapshot, then request a read, then destroy the snapshot. For regions which have a single underlying "contents", it makes sense to allow supporting direct reading of the region data. Extend the DEVLINK_CMD_REGION_READ to allow direct reading from a region if requested via the new DEVLINK_ATTR_REGION_DIRECT. If this attribute is set, then perform a direct read instead of using a snapshot. Direct read is mutually exclusive with DEVLINK_ATTR_REGION_SNAPSHOT_ID, and care is taken to ensure that we reject commands which provide incorrect attributes. Regions must enable support for direct read by implementing the .read() callback function. If a region does not support such direct reads, a suitable extended error message is reported. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30bpf: Fix a compilation failure with clang lto buildYonghong Song
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the following compilation error will appear: $ make LLVM=1 LLVM_IAS=1 -j ... ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined cgroup_storage_map_btf_ids:; ^ make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1 In local_storage.c, we have BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map) Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker complains with lto build. Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times for sk, inode, task and cgrp local storages. Let us define a single global one with a different name than cgroup_storage_map_btf_ids, which also fixed the lto compilation error. Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-11-30cxl/pmem: Introduce nvdimm_security_ops with ->get_flags() operationDave Jiang
Add nvdimm_security_ops support for CXL memory device with the introduction of the ->get_flags() callback function. This is part of the "Persistent Memory Data-at-rest Security" command set for CXL memory device support. The ->get_flags() function provides the security state of the persistent memory device defined by the CXL 3.0 spec section 8.2.9.8.6.1. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/166983609611.2734609.13231854299523325319.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-11-30linux/init.h: include <linux/build_bug.h> and <linux/stringify.h>Masahiro Yamada
With CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y, the following code fails to build: ---------------->8---------------- #include <linux/init.h> int foo(void) { return 0; } core_initcall(foo); ---------------->8---------------- Include <linux/build_bug.h> for static_assert() and <linux/stringify.h> for __stringify(). Link: https://lkml.kernel.org/r/20221113110802.3760705-1-masahiroy@kernel.org Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Jiangshan Yi <yijiangshan@kylinos.cn> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> # build-tested Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30debugfs: fix error when writing negative value to atomic_t debugfs fileAkinobu Mita
The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value for a debugfs file created by debugfs_create_atomic_t(). This restores the previous behaviour by introducing DEFINE_DEBUGFS_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-4-akinobu.mita@gmail.com Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed valueAkinobu Mita
Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), but some attribute files want to accept a negative value. This patch (of 3): The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value. This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-01bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytesPengcheng Yang
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS flag from the msg->flags. If apply_bytes is used and it is larger than the current data being processed, sk_psock_msg_verdict() will not be called when sendmsg() is called again. At this time, the msg->flags is 0, and we lost the BPF_F_INGRESS flag. So we need to save the BPF_F_INGRESS flag in sk_psock and use it when redirection. Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com
2022-11-30s390/mm: use pmd_pgtable_page() helper in __gmap_segment_gaddr()Anshuman Khandual
In __gmap_segment_gaddr() pmd level page table page is being extracted from the pmd pointer, similar to pmd_pgtable_page() implementation. This reduces some redundancy by directly using pmd_pgtable_page() instead, though first making it available. Link: https://lkml.kernel.org/r/20221125034502.1559986-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/thp: rename pmd_to_page() as pmd_pgtable_page()Anshuman Khandual
Current pmd_to_page(), which derives the page table page containing the pmd address has a very misleading name. The problem being, it sounds similar to pmd_page() which derives page embedded in a given pmd entry either for next level page or a mapped huge page. Rename it as pmd_pgtable_page() instead. Link: https://lkml.kernel.org/r/20221124131641.1523772-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_min_ratio_no_scale() functionStefan Roesch
This introduces bdi_set_min_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob min_ratio_fine. Link: https://lkml.kernel.org/r/20221119005215.3052436-19-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_max_ratio_no_scale() functionStefan Roesch
This introduces bdi_set_max_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob max_ratio_fine. Link: https://lkml.kernel.org/r/20221119005215.3052436-16-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_min_bytes() functionStefan Roesch
This introduces the bdi_set_min_bytes() function. The min_bytes function does not store the min_bytes value. Instead it converts the min_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/20221119005215.3052436-13-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_get_min_bytes() functionStefan Roesch
This adds a function to return the specified value for min_bytes. It converts the stored min_ratio of the bdi to the corresponding bytes value. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The returned value can be different than the value when the min_bytes value was set. Link: https://lkml.kernel.org/r/20221119005215.3052436-11-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_max_bytes() functionStefan Roesch
This introduces the bdi_set_max_bytes() function. The max_bytes function does not store the max_bytes value. Instead it converts the max_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/20221119005215.3052436-8-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_get_max_bytes() functionStefan Roesch
This adds a function to return the specified value for max_bytes. It converts the stored max_ratio of the bdi to the corresponding bytes value. It introduces the bdi_get_bytes helper function to do the conversion. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The helper function will also be used by the min_bytes bdi knob. Link: https://lkml.kernel.org/r/20221119005215.3052436-6-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: use part per 1000000 for bdi ratiosStefan Roesch
To get finer granularity for ratio calculations use part per million instead of percentiles. This is especially important if we want to automatically convert byte values to ratios. Otherwise the values that are actually used can be quite different. This is also important for machines with more main memory (1% of 256GB is already 2.5GB). Link: https://lkml.kernel.org/r/20221119005215.3052436-5-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_strict_limit() functionStefan Roesch
Patch series "mm/block: add bdi sysfs knobs", v4. At meta network block devices (nbd) are used to implement remote block storage. In testing and during production it has been observed that these network block devices can consume a huge portion of the dirty writeback cache and writeback can take a considerable time. To be able to give stricter limits, I'm proposing the following changes: 1) introduce strictlimit knob Currently the max_ratio knob exists to limit the dirty_memory. However this knob only applies once (dirty_ratio + dirty_background_ratio) / 2 has been reached. With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without reaching that limit. This change exposes that knob. This knob can also be useful for NFS, fuse filesystems and USB devices. 2) Use part of 1000000 internal calculation The max_ratio is based on percentage. With the current machine sizes percentage values can be very high (1% of a 256GB main memory is already 2.5GB). This change uses part of 1000000 instead of percentages for the internal calculations. 3) Introduce two new sysfs knobs: min_bytes and max_bytes. Currently all calculations are based on ratio, but for a user it often more convenient to specify a limit in bytes. The new knobs will not store bytes values, instead they will translate the byte value to a corresponding ratio. As the internal values are now part of 1000, the ratio is closer to the specified value. However the value should be more seen as an approximation as it can fluctuate over time. 3) Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine. The granularity for the existing sysfs bdi knobs min_ratio and max_ratio is based on percentage values. The new sysfs bdi knobs min_ratio_fine and max_ratio_fine allow to specify the ratio as part of 1 million. This patch (of 20): This adds the bdi_set_strict_limit function to be able to set/unset the BDI_CAP_STRICTLIMIT flag. Link: https://lkml.kernel.org/r/20221119005215.3052436-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20221119005215.3052436-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Jens Axboe <axboe@kernel.dk> Cc: Chris Mason <clm@meta.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30folio-compat: remove try_to_release_page()Vishal Moola (Oracle)
There are no more callers of try_to_release_page(), so remove it. This saves 85 bytes of kernel text. Link: https://lkml.kernel.org/r/20221118073055.55694-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/gup: reliable R/O long-term pinning in COW mappingsDavid Hildenbrand
We already support reliable R/O pinning of anonymous memory. However, assume we end up pinning (R/O long-term) a pagecache page or the shared zeropage inside a writable private ("COW") mapping. The next write access will trigger a write-fault and replace the pinned page by an exclusive anonymous page in the process page tables to break COW: the pinned page no longer corresponds to the page mapped into the process' page table. Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a COW mapping, let's properly break COW first before R/O long-term pinning something that's not an exclusive anon page inside a COW mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page instead that can get pinned safely. With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable R/O long-term pinning in COW mappings. With this change, the new R/O long-term pinning tests for non-anonymous memory succeed: # [RUN] R/O longterm GUP pin ... with shared zeropage ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) ok 162 Longterm R/O pin is reliable Note 1: We don't care about short-term R/O-pinning, because they have snapshot semantics: they are not supposed to observe modifications that happen after pinning. As one example, assume we start direct I/O to read from a page and store page content into a file: modifications to page content after starting direct I/O are not guaranteed to end up in the file. So even if we'd pin the shared zeropage, the end result would be as expected -- getting zeroes stored to the file. Note 2: For shared mappings we'll now always fallback to the slow path to lookup the VMA when R/O long-term pining. While that's the necessary price we have to pay right now, it's actually not that bad in practice: most FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with FOLL_FORCE because they tried dealing with COW mappings correctly ... Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd populate exclusive anon pages that we can pin. There was a concern that this could affect the memlock limit of existing setups. For example, a VM running with VFIO could run into the memlock limit and fail to run. However, we essentially had the same behavior already in commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") which got merged into some enterprise distros, and there were not any such complaints. So most probably, we're fine. Link: https://lkml.kernel.org/r/20221116102659.70287-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mappingDavid Hildenbrand
Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a COW (i.e., private writable) mapping and adjust the documentation accordingly. FAULT_FLAG_UNSHARE will now also break COW when encountering the shared zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by properly replacing the mapped page/pfn by a private copy (an exclusive anonymous page). Note that only do_wp_page() needs care: hugetlb_wp() already handles FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE such that we can handle FAULT_FLAG_UNSHARE on the PTE level. This change is a requirement for reliable long-term R/O pinning in COW mappings. Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: anonymous shared memory namingPasha Tatashin
Since commit 9a10064f5625 ("mm: add a field to store names for private anonymous memory"), name for private anonymous memory, but not shared anonymous, can be set. However, naming shared anonymous memory just as useful for tracking purposes. Extend the functionality to be able to set names for shared anon. There are two ways to create anonymous shared memory, using memfd or directly via mmap(): 1. fd = memfd_create(...) mem = mmap(..., MAP_SHARED, fd, ...) 2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...) In both cases the anonymous shared memory is created the same way by mapping an unlinked file on tmpfs. The memfd way allows to give a name for anonymous shared memory, but not useful when parts of shared memory require to have distinct names. Example use case: The VMM maps VM memory as anonymous shared memory (not private because VMM is sandboxed and drivers are running in their own processes). However, the VM tells back to the VMM how parts of the memory are actually used by the guest, how each of the segments should be backed (i.e. 4K pages, 2M pages), and some other information about the segments. The naming allows us to monitor the effective memory footprint for each of these segments from the host without looking inside the guest. Sample output: /* Create shared anonymous segmenet */ anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); /* Name the segment: "MY-NAME" */ rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, anon_shmem, SIZE, "MY-NAME"); cat /proc/<pid>/maps (and smaps): 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME] If the segment is not named, the output is: 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted) Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Colin Cross <ccross@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <cgel.zte@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: shrinkers: add missing includes for undeclared typesT.J. Mercier
The shrinker.h header depends on a user including other headers before it for types used by shrinker.h. Fix this by including the appropriate headers in shrinker.h. ./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t' 13 | gfp_t gfp_mask; | ^~~~~ ./include/linux/shrinker.h:71:26: error: field `list' has incomplete type 71 | struct list_head list; | ^~~~ ./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t' 82 | atomic_long_t *nr_deferred; | Link: https://lkml.kernel.org/r/20221114235949.201749-1-tjmercier@google.com Fixes: 83aeeada7c69 ("vmscan: use atomic-long for shrinker batching") Fixes: b0d40c92adaf ("superblock: introduce per-sb cache shrinker infrastructure") Signed-off-by: T.J. Mercier <tjmercier@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30zram: add size class equals check into recompressionAlexey Romanov
It makes no sense for us to recompress the object if it will be in the same size class. We anyway don't get any memory gain. But, at the same time, we get a CPU time overhead when inserting this object into zspage and decompressing it afterwards. [senozhatsky: rebased and fixed conflicts] Link: https://lkml.kernel.org/r/20221109115047.2921851-9-senozhatsky@chromium.org Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: delay page_remove_rmap() until after the TLB has been flushedLinus Torvalds
When we remove a page table entry, we are very careful to only free the page after we have flushed the TLB, because other CPUs could still be using the page through stale TLB entries until after the flush. However, we have removed the rmap entry for that page early, which means that functions like folio_mkclean() would end up not serializing with the page table lock because the page had already been made invisible to rmap. And that is a problem, because while the TLB entry exists, we could end up with the following situation: (a) one CPU could come in and clean it, never seeing our mapping of the page (b) another CPU could continue to use the stale and dirty TLB entry and continue to write to said page resulting in a page that has been dirtied, but then marked clean again, all while another CPU might have dirtied it some more. End result: possibly lost dirty data. This extends our current TLB gather infrastructure to optionally track a "should I do a delayed page_remove_rmap() for this page after flushing the TLB". It uses the newly introduced 'encoded page pointer' to do that without having to keep separate data around. Note, this is complicated by a couple of issues: - we want to delay the rmap removal, but not past the page table lock, because that simplifies the memcg accounting - only SMP configurations want to delay TLB flushing, since on UP there are obviously no remote TLBs to worry about, and the page table lock means there are no preemption issues either - s390 has its own mmu_gather model that doesn't delay TLB flushing, and as a result also does not want the delayed rmap. As such, we can treat S390 like the UP case and use a common fallback for the "no delays" case. - we can track an enormous number of pages in our mmu_gather structure, with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each, all set up to be approximately 10k pending pages. We do not want to have a huge number of batched pages that we then need to check for delayed rmap handling inside the page table lock. Particularly that last point results in a noteworthy detail, where the normal page batch gathering is limited once we have delayed rmaps pending, in such a way that only the last batch (the so-called "active batch") in the mmu_gather structure can have any delayed entries. NOTE! While the "possibly lost dirty data" sounds catastrophic, for this all to happen you need to have a user thread doing either madvise() with MADV_DONTNEED or a full re-mmap() of the area concurrently with another thread continuing to use said mapping. So arguably this is about user space doing crazy things, but from a VM consistency standpoint it's better if we track the dirty bit properly even when user space goes off the rails. [akpm@linux-foundation.org: fix UP build, per Linus] Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: mmu_gather: prepare to gather encoded page pointers with flagsLinus Torvalds
This is purely a preparatory patch that makes all the data structures ready for encoding flags with the mmu_gather page pointers. The code currently always sets the flag to zero and doesn't use it yet, but now it's tracking the type state along. The next step will be to actually start using it. Link: https://lkml.kernel.org/r/20221109203051.1835763-3-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: teach release_pages() to take an array of encoded page pointers tooLinus Torvalds
release_pages() already could take either an array of page pointers, or an array of folio pointers. Expand it to also accept an array of encoded page pointers, which is what both the existing mlock() use and the upcoming mmu_gather use of encoded page pointers wants. Note that release_pages() won't actually use, or react to, any extra encoded bits. Instead, this is very much a case of "I have walked the array of encoded pages and done everything the extra bits tell me to do, now release it all". Also, while the "either page or folio pointers" dual use was handled with a cast of the pointer in "release_folios()", this takes a slightly different approach and uses the "transparent union" attribute to describe the set of arguments to the function: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html which has been supported by gcc forever, but the kernel hasn't used before. That allows us to avoid using various wrappers with casts, and just use the same function regardless of use. Link: https://lkml.kernel.org/r/20221109203051.1835763-2-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: introduce 'encoded' page pointers with embedded extra bitsLinus Torvalds
We already have this notion in parts of the MM code (see the mlock code with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new case, and I refuse to do the same thing we've done before where we just put bits in the raw pointer and say it's still a normal pointer. So this introduces a 'struct encoded_page' pointer that cannot be used for anything else than to encode a real page pointer and a couple of extra bits in the low bits. That way the compiler can trivially track the state of the pointer and you just explicitly encode and decode the extra bits. Note that this makes the alignment of 'struct page' explicit even for the case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set. That is entirely redundant in almost all cases, since the page structure already contains several word-sized entries. However, on m68k, the alignment of even 32-bit data is just 16 bits, and as such in theory the alignment of 'struct page' could be too. So let's just make it very very explicit that the alignment needs to be at least 32 bits, giving us a guarantee of two unused low bits in the pointer. Now, in practice, our page struct array is aligned much more than that anyway, even on m68k, and our existing code in mm/mlock.c obviously already depended on that. But since the whole point of this change is to be careful about the type system when hiding extra bits in the pointer, let's also be explicit about the assumptions we make. NOTE! This is being very careful in another way too: it has a build-time assertion that the 'flags' added to the page pointer actually fit in the two bits. That means that this helper must be inlined, and can only be used in contexts where the compiler can statically determine that the value fits in the available bits. [akpm@linux-foundation.org: kerneldoc on a forward-declared struct confuses htmldocs] Link: https://lore.kernel.org/all/Y2tKixpO4RO6DgW5@tuxmaker.boeblingen.de.ibm.com/ Link: https://lkml.kernel.org/r/20221109203051.1835763-1-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> [s390] Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: remove unused savedwrite infrastructureDavid Hildenbrand
NUMA hinting no longer uses savedwrite, let's rip it out. ... and while at it, drop __pte_write() and __pmd_write() on ppc64. Link: https://lkml.kernel.org/r/20221108174652.198904-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwriteDavid Hildenbrand
commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") added remembering write permissions using ordinary pte_write() for PROT_NONE mapped pages to avoid write faults when remapping the page !PROT_NONE on NUMA hinting faults. That commit noted: The patch looks hacky but the alternatives looked worse. The tidest was to rewalk the page tables after a hinting fault but it was more complex than this approach and the performance was worse. It's not generally safe to just mark the page writable during the fault if it's a write fault as it may have been read-only for COW so that approach was discarded. Later, commit 288bc54949fc ("mm/autonuma: let architecture override how the write bit should be stashed in a protnone pte.") introduced a family of savedwrite PTE functions that didn't necessarily improve the whole situation. One confusing thing is that nowadays, if a page is pte_protnone() and pte_savedwrite() then also pte_write() is true. Another source of confusion is that there is only a single pte_mk_savedwrite() call in the kernel. All other write-protection code seems to silently rely on pte_wrprotect(). Ever since PageAnonExclusive was introduced and we started using it in mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection"), we do have machinery in place to avoid write faults when changing protection, which is exactly what we want to do here. Let's similarly do what ordinary mprotect() does nowadays when upgrading write permissions and reuse can_change_pte_writable() and can_change_pmd_writable() to detect if we can upgrade PTE permissions to be writable. For anonymous pages there should be absolutely no change: if an anonymous page is not exclusive, it could not have been mapped writable -- because only exclusive anonymous pages can be mapped writable. However, there *might* be a change for writable shared mappings that require writenotify: if they are not dirty, we cannot map them writable. While it might not matter in practice, we'd need a different way to identify whether writenotify is actually required -- and ordinary mprotect would benefit from that as well. Note that we don't optimize for the actual migration case: (1) When migration succeeds the new PTE will not be writable because the source PTE was not writable (protnone); in the future we might just optimize that case similarly by reusing can_change_pte_writable()/can_change_pmd_writable() when removing migration PTEs. (2) When migration fails, we'd have to recalculate the "writable" flag because we temporarily dropped the PT lock; for now keep it simple and set "writable=false". We'll remove all savedwrite leftovers next. Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/mprotect: factor out check whether manual PTE write upgrades are requiredDavid Hildenbrand
Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be reused in NUMA hinting fault context soon. Link: https://lkml.kernel.org/r/20221108174652.198904-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mappedHugh Dickins
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit; but if we slightly abuse subpages_mapcount by additionally demanding that one bit be set there when the compound page is PMD-mapped, then a cascade of two atomic ops is able to maintain the stats without bit_spin_lock. This is harder to reason about than when bit_spin_locked, but I believe safe; and no drift in stats detected when testing. When there are racing removes and adds, of course the sequence of operations is less well- defined; but each operation on subpages_mapcount is atomically good. What might be disastrous, is if subpages_mapcount could ever fleetingly appear negative: but the pte lock (or pmd lock) these rmap functions are called under, ensures that a last remove cannot race ahead of a first add. Continue to make an exception for hugetlb (PageHuge) pages, though that exception can be easily removed by a further commit if necessary: leave subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just carry on checking compound_mapcount too in folio_mapped(), page_mapped(). Evidence is that this way goes slightly faster than the previous implementation in all cases (pmds after ptes now taking around 103ms); and relieves us of worrying about contention on the bit_spin_lock. Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <error27@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: subpages_mapcount of PTE-mapped subpagesHugh Dickins
Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2. This patch (of 3): Following suggestion from Linus, instead of counting every PTE map of a compound page in subpages_mapcount, just count how many of its subpages are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and requires updating the count less often. This does then revert total_mapcount() and folio_mapcount() to needing a scan of subpages; but they are inherently racy, and need no locking, so Linus is right that the scans are much better done there. Plus (unlike in 6.1 and previous) subpages_mapcount lets us avoid the scan in the common case of no PTE maps. And page_mapped() and folio_mapped() remain scanless and just as efficient with the new meaning of subpages_mapcount: those are the functions which I most wanted to remove the scan from. The updated page_dup_compound_rmap() is no longer suitable for use by anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted. Evidence is that this way goes slightly faster than the previous implementation for most cases; but significantly faster in the (now scanless) pmds after ptes case, which started out at 870ms and was brought down to 495ms by the previous series, now takes around 105ms. Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <error27@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: lock_compound_mapcounts() on THP mapcountsHugh Dickins
Fix the races in maintaining compound_mapcount, subpages_mapcount and subpage _mapcount by using PG_locked in the first tail of any compound page for a bit_spin_lock() on such modifications; skipping the usual atomic operations on those fields in this case. Bring page_remove_file_rmap() and page_remove_anon_compound_rmap() back into page_remove_rmap() itself. Rearrange page_add_anon_rmap() and page_add_file_rmap() and page_remove_rmap() to follow the same "if (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern (with a PageTransHuge in the compound test, like before, to avoid BUG_ONs and optimize away that block when THP is not configured). Move all the stats updates outside, after the bit_spin_locked section, so that it is sure to be a leaf lock. Add page_dup_compound_rmap() to manage compound locking versus atomics in sync with the rest. In particular, hugetlb pages are still using the atomics: to avoid unnecessary interference there, and because they never have subpage mappings; but this exception can easily be changed. Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's __split_huge_pmd_locked() too. bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure is to the non-hugetlb non-THP pte-mapped compound pages (with large folios being currently dependent on TRANSPARENT_HUGEPAGE). There is never any scan of subpages in this case; but we have chosen to use PageCompound tests rather than PageTransCompound tests to gate the use of lock_compound_mapcounts(), so that page_mapped() is correct on all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled: could that be a problem for PREEMPT_RT, when there is contention on the lock - under heavy concurrent forking for example? If so, then it can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 115 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 86ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. Mapping huge pages by ptes is largely unaffected but variable: between 5% faster and 5% slower in what I've recorded. Contention on the lock is likely to behave worse than contention on the atomics behaved. Link: https://lkml.kernel.org/r/1b42bd1a-8223-e827-602f-d466c2db7d3c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: simplify compound page mapcount handlingHugh Dickins
Compound page (folio) mapcount calculations have been different for anon and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. And each huge mapping and unmapping of a file (or shmem) THP involved atomically incrementing and decrementing the mapcount of every subpage of that huge page, dirtying many struct page cachelines. Add subpages_mapcount field to the struct folio and first tail page, so that the total of subpage mapcounts is available in one place near the head: then page_mapcount() and total_mapcount() and page_mapped(), and their folio equivalents, are so quick that anon and file and hugetlb don't need to be optimized differently. Delete the unloved PageDoubleMap. page_add and page_remove rmap functions must now maintain the subpages_mapcount as well as the subpage _mapcount, when dealing with pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and NR_FILE_MAPPED statistics still needs reading through the subpages, using nr_subpages_unmapped() - but only when first or last pmd mapping finds subpages_mapcount raised (double-map case, not the common case). But are those counts (used to decide when to split an anon THP, and in vmscan's pagecache_reclaimable heuristic) correctly maintained? Not quite: since page_remove_rmap() (and also split_huge_pmd()) is often called without page lock, there can be races when a subpage pte mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races which the previous implementation had prevented. The statistics might become inaccurate, and even drift down until they underflow through 0. That is not good enough, but is better dealt with in a followup patch. Update a few comments on first and second tail page overlaid fields. hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but subpages_mapcount and compound_pincount are already correctly at 0, so delete its reinitialization of compound_pincount. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 119 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 92ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. But there might be some benchmarks which would show a slowdown, because tail struct pages now fall out of cache until final freeing checks them. Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,hugetlb: use folio fields in second tail pageHugh Dickins
Patch series "mm,huge,rmap: unify and speed up compound mapcounts". This patch (of 3): We want to declare one more int in the first tail of a compound page: that first tail page being valuable property, since every compound page has a first tail, but perhaps no more than that. No problem on 64-bit: there is already space for it. No problem with 32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always available") kindly cleared the space for it, apparently not realizing that only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail page->private might conflict) - but make sure of that in its Kconfig. But hugetlb pages use tail page->private of the first tail page for a subpool pointer, which will conflict; and they also use page->private of the 2nd, 3rd and 4th tails. Undo "mm: add private field of first tail to struct page and struct folio"'s recent addition of private_1 to the folio tail: instead add hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to a second tail page of the folio: THP has long been using several fields of that tail, so make better use of it for hugetlb too. This is not how a generic folio should be declared in future, but it is an effective transitional way to make use of it. Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3. [hughd@google.com: prefix folio's page_1 and page_2 with double underscore, give folio's _flags_2 and _head_2 a line documentation each] Link: https://lkml.kernel.org/r/9e2cb6b-5b58-d3f2-b5ee-5f8a14e8f10@google.com Link: https://lkml.kernel.org/r/5f52de70-975-e94f-f141-543765736181@google.com Link: https://lkml.kernel.org/r/3818cc9a-9999-d064-d778-9c94c5911e6@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: use pte markers for swap errorsPeter Xu
PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR. Using a whole swap entry type for this purpose can be an overkill, especially if we already have PTE markers. Define a new bit for swapin error and replace it with pte markers. Then we can safely drop SWP_SWAPIN_ERROR and give one device slot back to swap. We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap entry, but it's never used. Neither do I see how it can be useful because normally the swapin failure should not be caused by a bad page but bad swap device. Drop it alongside. Link: https://lkml.kernel.org/r/20221030214151.402274-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: always compile in pte markersPeter Xu
Patch series "mm: Use pte marker for swapin errors". This series uses the pte marker to replace the swapin error swap entry, then we save one more swap entry slot for swap devices. A new pte marker bit is defined. This patch (of 2): The PTE markers code is tiny and now it's enabled for most of the distributions. It's fine to keep it as-is, but to make a broader use of it (e.g. replacing read error swap entry) it needs to be there always otherwise we need special code path to take care of !PTE_MARKER case. It'll be easier just make pte marker always exist. Use this chance to extend its usage to anonymous too by simply touching up some of the old comments, because it'll be used for anonymous pages in the follow up patches. Link: https://lkml.kernel.org/r/20221030214151.402274-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221030214151.402274-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/damon/core: add a callback for scheme target regions checkSeongJae Park
Patch series "efficiently expose damos action tried regions information". DAMON users can retrieve the monitoring results via 'after_aggregation' callbacks if the user is using the kernel API, or 'damon_aggregated' tracepoint if the user is in the user space. Those are useful if full monitoring results are necessary. However, if the user has interest in only a snapshot of the results for some regions having specific access pattern, the interfaces could be inefficient. For example, some users only want to know which memory regions are not accessed for more than a specific time at the moment. Also, some DAMOS users would want to know exactly to what memory regions the schemes' actions tried to be applied, for a debugging or a tuning. As DAMOS has its internal mechanism for quota and regions prioritization, the users would need to simulate DAMOS' mechanism against the monitoring results. That's unnecessarily complex. This patchset implements DAMON kernel API callbacks and sysfs directory for efficient exposure of the information for the use cases. The new callback will be called for each region when a DAMOS action is gonna tried to be applied to it. The sysfs directory will be called 'tried_regions' and placed under each scheme sysfs directory. Users can write a special keyworkd, 'update_schemes_regions', to the 'state' file of a kdamond sysfs directory. Then, DAMON sysfs interface will fill the directory with the information of regions that corresponding scheme action was tried to be applied for next one aggregation interval. Patches Sequence ---------------- The first one (patch 1) implements the callback for the kernel space users. Following two patches (patches 2 and 3) implements sysfs directories for the information and its sub directories. Two patches (patches 4 and 5) for implementing the special keywords for filling the data to and cleaning up the directories follow. Patch 6 adds a selftest for the new sysfs directory. Finally, two patches (patches 7 and 8) document the new feature in the administrator guide and the ABI document. This patch (of 8): Getting DAMON monitoring results of only specific access pattern (e.g., getting address ranges of memory that not accessed at all for two minutes) can be useful for efficient monitoring of the system. The information can also be helpful for deep level investigation of DAMON-based operation schemes. For that, users need to record (in case of the user space users) or iterate (in case of the kernel space users) full monitoring results and filter it out for the specific access pattern. In case of the DAMOS investigation, users will even need to simulate DAMOS' quota and prioritization mechanisms. It's inefficient and complex. Add a new DAMON callback that will be called before each scheme is applied to each region. DAMON kernel API users will be able to do the query-like monitoring results collection, or DAMOS investigation in an efficient and simple way using it. Commits for providing the capability to the user space users will follow. Link: https://lkml.kernel.org/r/20221101220328.95765-1-sj@kernel.org Link: https://lkml.kernel.org/r/20221101220328.95765-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>