summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2023-04-06io_uring: reduce scheduling due to twPavel Begunkov
Every task_work will try to wake the task to be executed, which causes excessive scheduling and additional overhead. For some tw it's justified, but others won't do much but post a single CQE. When a task waits for multiple cqes, every such task_work will wake it up. Instead, the task may give a hint about how many cqes it waits for, io_req_local_work_add() will compare against it and skip wake ups if #cqes + #tw is not enough to satisfy the waiting condition. Task_work that uses the optimisation should be simple enough and never post more than one CQE. It's also ignored for non DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-06PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()Reinette Chatre
pci_msix_can_alloc_dyn() is not declared when CONFIG_PCI_MSI is disabled. There is no existing user of pci_msix_can_alloc_dyn() but work is in progress to change this. This work encounters the following error when CONFIG_PCI_MSI is disabled: drivers/vfio/pci/vfio_pci_intrs.c:427:21: error: implicit declaration of function 'pci_msix_can_alloc_dyn' [-Werror=implicit-function-declaration] Provide definition for pci_msix_can_alloc_dyn() in preparation for users that need to compile when CONFIG_PCI_MSI is disabled. [bhelgaas: Also reported by Arnd Bergmann <arnd@kernel.org> in drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c; added his Fixes: line] Fixes: fb0a6a268dcd ("net/mlx5: Provide external API for allocating vectors") Fixes: 34026364df8e ("PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X") Link: https://lore.kernel.org/oe-kbuild-all/202303291000.PWFqGCxH-lkp@intel.com/ Link: https://lore.kernel.org/r/310ecc4815dae4174031062f525245f0755c70e2.1680119924.git.reinette.chatre@intel.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: stable@vger.kernel.org # v6.2+
2023-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/google/gve/gve.h 3ce934558097 ("gve: Secure enough bytes in the first TX desc for all TCP pkts") 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format") https://lore.kernel.org/all/20230406104927.45d176f5@canb.auug.org.au/ https://lore.kernel.org/all/c5872985-1a95-0bc8-9dcc-b6f23b439e9d@tessares.net/ Adjacent changes: net/can/isotp.c 051737439eae ("can: isotp: fix race between isotp_sendsmg() and isotp_release()") 96d1c81e6a04 ("can: isotp: add module parameter for maximum pdu size") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-06Merge tag 'net-6.3-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless and can. Current release - regressions: - wifi: mac80211: - fix potential null pointer dereference - fix receiving mesh packets in forwarding=0 networks - fix mesh forwarding Current release - new code bugs: - virtio/vsock: fix leaks due to missing skb owner Previous releases - regressions: - raw: fix NULL deref in raw_get_next(). - sctp: check send stream number after wait_for_sndbuf - qrtr: - fix a refcount bug in qrtr_recvmsg() - do not do DEL_SERVER broadcast after DEL_CLIENT - wifi: brcmfmac: fix SDIO suspend/resume regression - wifi: mt76: fix use-after-free in fw features query. - can: fix race between isotp_sendsmg() and isotp_release() - eth: mtk_eth_soc: fix remaining throughput regression - eth: ice: reset FDIR counter in FDIR init stage Previous releases - always broken: - core: don't let netpoll invoke NAPI if in xmit context - icmp: guard against too small mtu - ipv6: fix an uninit variable access bug in __ip6_make_skb() - wifi: mac80211: fix the size calculation of ieee80211_ie_len_eht_cap() - can: fix poll() to not report false EPOLLOUT events - eth: gve: secure enough bytes in the first TX desc for all TCP pkts" * tag 'net-6.3-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits) net: stmmac: check fwnode for phy device before scanning for phy net: stmmac: Add queue reset into stmmac_xdp_open() function selftests: net: rps_default_mask.sh: delete veth link specifically net: fec: make use of MDIO C45 quirk can: isotp: fix race between isotp_sendsmg() and isotp_release() can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events can: isotp: isotp_recvmsg(): use sock_recv_cmsgs() to get SOCK_RXQ_OVFL infos can: j1939: j1939_tp_tx_dat_new(): fix out-of-bounds memory access gve: Secure enough bytes in the first TX desc for all TCP pkts netlink: annotate lockless accesses to nlk->max_recvmsg_len ethtool: reset #lanes when lanes is omitted ping: Fix potentail NULL deref for /proc/net/icmp. raw: Fix NULL deref in raw_get_next(). ice: Reset FDIR counter in FDIR init stage ice: fix wrong fallback logic for FDIR net: stmmac: fix up RX flow hash indirection table when setting channels net: ethernet: ti: am65-cpsw: Fix mdio cleanup in probe wifi: mt76: ignore key disable commands wifi: ath11k: reduce the MHI timeout to 20s ipv6: Fix an uninit variable access bug in __ip6_make_skb() ...
2023-04-06ACPI: Replace irqdomain.h include with struct declarationsRob Herring
linux/acpi.h includes irqdomain.h which includes of.h. Break the include chain by replacing the irqdomain include with forward declarations for struct irq_domain and irq_domain_ops which is sufficient for acpi.h. of.h also includes mod_devicetable.h which many drivers implicitly depend on. As acpi.h already includes it, just move it out of the '#ifdef CONFIG_ACPI'. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-04-06ext4: Convert ext4_bio_write_page() to use a folioMatthew Wilcox
Remove several calls to compound_head() and the last caller of set_page_writeback_keepwrite(), so remove the wrapper too. Also export bio_add_folio() as this is the first caller from a module. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06fscrypt: Add some folio helper functionsMatthew Wilcox
fscrypt_is_bounce_folio() is the equivalent of fscrypt_is_bounce_page() and fscrypt_pagecache_folio() is the equivalent of fscrypt_pagecache_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230324180129.1220691-3-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06fs: Add FGP_WRITEBEGINMatthew Wilcox
This particular combination of flags is used by most filesystems in their ->write_begin method, although it does find use in a few other places. Before folios, it warranted its own function (grab_cache_page_write_begin()), but I think that just having specialised flags is enough. It certainly helps the few places that have been converted from grab_cache_page_write_begin() to __filemap_get_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-2-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06USB: core: Fix docs warning caused by wireless_status featureBastien Nocera
Fix wrongly named 'dev' parameter in doc block, should have been iface: drivers/usb/core/message.c:1939: warning: Function parameter or member 'iface' not described in 'usb_set_wireless_status' drivers/usb/core/message.c:1939: warning: Excess function parameter 'dev' description in 'usb_set_wireless_status' And fix missing struct member doc in kernel API, and reorder to match struct: include/linux/usb.h:270: warning: Function parameter or member 'wireless_status_work' not described in 'usb_interface' Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/linux-next/20230405114807.5a57bf46@canb.auug.org.au/T/#t Fixes: 0a4db185f078 ("USB: core: Add API to change the wireless_status") Signed-off-by: Bastien Nocera <hadess@hadess.net> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20230405092754.36579-1-hadess@hadess.net [bentiss: fix checkpatch warning] Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
2023-04-06pwm: stm32: Enforce settings for PWM captureOlivier Moysan
The PWM capture assumes that the input selector is set to default input and that the slave mode is disabled. Force reset state for TISEL and SMCR registers to match this requirement. Note that slave mode disabling is not a pre-requisite by itself for capture mode, as hardware supports it for PWM capture. However, the current implementation of the driver does not allow slave mode for PWM capture. Setting slave mode for PWM capture results in wrong capture values. Signed-off-by: Olivier Moysan <olivier.moysan@foss.st.com> Acked-by: Lee Jones <lee@kernel.org> Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Thierry Reding <thierry.reding@gmail.com>
2023-04-06Merge tag 'drm/tegra/for-6.4-rc1' of ↵Daniel Vetter
https://gitlab.freedesktop.org/drm/tegra into drm-next drm/tegra: Changes for v6.4-rc1 The majority of this is minor cleanups and fixes. Other than those, this contains Uwe's conversion to the new driver remove callback and Thomas' fbdev DRM client conversion. The driver can now also be built on other architectures to easy compile coverage. Finally, this adds Mikko as a second maintainer for the driver. As a next step we also want Tegra DRM to move into drm-misc to streamline the maintenance process. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Thierry Reding <thierry.reding@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230406121404.967704-1-thierry.reding@gmail.com
2023-04-06platform/x86: apple-gmux: Fix iomem_base __iomem annotationHans de Goede
Fix the __iomem annotation of the iomem_base pointers in the apple-gmux code. The __iomem should go before the *. This fixes a bunch of sparse warnings like this one: drivers/platform/x86/apple-gmux.c:224:48: sparse: expected void const [noderef] __iomem * got unsigned char [usertype] * Fixes: 0c18184de990 ("platform/x86: apple-gmux: support MMIO gmux on T2 Macs") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202304040401.IMxt7Ubi-lkp@intel.com/ Suggested-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Orlando Chamberlain <orlandoch.dev@gmail.com> Link: https://lore.kernel.org/r/20230404111955.43266-1-hdegoede@redhat.com
2023-04-06netfilter: br_netfilter: fix recent physdev match breakageFlorian Westphal
Recent attempt to ensure PREROUTING hook is executed again when a decrypted ipsec packet received on a bridge passes through the network stack a second time broke the physdev match in INPUT hook. We can't discard the nf_bridge info strct from sabotage_in hook, as this is needed by the physdev match. Keep the struct around and handle this with another conditional instead. Fixes: 2b272bb558f1 ("netfilter: br_netfilter: disable sabotage_in hook after first suppression") Reported-and-tested-by: Farid BENAMROUCHE <fariouche@yahoo.fr> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-05kallsyms: move module-related functions under correct configsViktor Malik
Functions for searching module kallsyms should have non-empty definitions only if CONFIG_MODULES=y and CONFIG_KALLSYMS=y. Until now, only CONFIG_MODULES check was used for many of these, which may have caused complilation errors on some configs. This patch moves all relevant functions under the correct configs. Fixes: bd5314f8dd2d ("kallsyms, bpf: Move find_kallsyms_symbol_value out of internal header") Signed-off-by: Viktor Malik <vmalik@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303181535.RFDCnz3E-lkp@intel.com/ Link: https://lore.kernel.org/r/20230330102001.2183693-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-05rpmsg: qcom_smd: Make qcom_smd_unregister_edge() return voidUwe Kleine-König
This function returned zero unconditionally. Convert it to return no value instead. This makes it more obvious what happens in the callers. One caller is converted to return zero explicitly. The only other caller (smd_subdev_stop() in drivers/remoteproc/qcom_common.c) already ignored the return value before. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230321154039.355098-2-u.kleine-koenig@pengutronix.de
2023-04-05sched/numa: use hash_32 to mix up PIDs accessing VMARaghavendra K T
before: last 6 bits of PID is used as index to store information about tasks accessing VMA's. after: hash_32 is used to take of cases where tasks are created over a period of time, and thus improve collision probability. Result: The patch series overall improves autonuma cost. Kernbench around more than 5% improvement and system time in mmtest autonuma showed more than 80% improvement Link: https://lkml.kernel.org/r/d5a9f75513300caed74e5c8570bba9317b963c2b.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05sched/numa: implement access PID reset logicRaghavendra K T
This helps to ensure that only recently accessed PIDs scan the VMAs. Current implementation: (idea supported by PeterZ) 1. Accessing PID information is maintained in two windows. access_pids[1] being newest. 2. Reset old access PID info i.e. access_pid[0] every (4 * sysctl_numa_balancing_scan_delay) interval after initial scan delay period expires. The above interval seemed to be experimentally optimum since it avoids frequent reset of access info as well as helps clearing the old access info regularly. The reset logic is implemented in scan path. Link: https://lkml.kernel.org/r/f7a675f66d1442d048b4216b2baf94515012c405.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Bharata B Rao <bharata@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05sched/numa: enhance vma scanning logicRaghavendra K T
During Numa scanning make sure only relevant vmas of the tasks are scanned. Before: All the tasks of a process participate in scanning the vma even if they do not access vma in it's lifespan. Now: Except cases of first few unconditional scans, if a process do not touch vma (exluding false positive cases of PID collisions) tasks no longer scan all vma Logic used: 1) 6 bits of PID used to mark active bit in vma numab status during fault to remember PIDs accessing vma. (Thanks Mel) 2) Subsequently in scan path, vma scanning is skipped if current PID had not accessed vma. 3) First two times we do allow unconditional scan to preserve earlier behaviour of scanning. Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of test and set bit) Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05sched/numa: apply the scan delay to every new vmaMel Gorman
Pach series "sched/numa: Enhance vma scanning", v3. The patchset proposes one of the enhancements to numa vma scanning suggested by Mel. This is continuation of [3]. Reposting the rebased patchset to akpm mm-unstable tree (March 1) Existing mechanism of scan period involves, scan period derived from per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA fault stats at per-process level to capture aplication behaviour better. During that course of discussion, Mel proposed several ideas to enhance current numa balancing. One of the suggestion was below Track what threads access a VMA. The suggestion was to use an unsigned long pid_mask and use the lower bits to tag approximately what threads access a VMA. Skip VMAs that did not trap a fault. This would be approximate because of PID collisions but would reduce scanning of areas the thread is not interested in. The above suggestion intends not to penalize threads that has no interest in the vma, thus reduce scanning overhead. V3 changes are mostly based on PeterZ comments (details below in changes) Summary of patchset: Current patchset implements: 1. Delay the vma scanning logic for newly created VMA's so that additional overhead of scanning is not incurred for short lived tasks (implementation by Mel) 2. Store the information of tasks accessing VMA in 2 windows. It is regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. The above time is derived from experimenting (Suggested by PeterZ) to balance between frequent clearing vs obsolete access data 3. hash_32 used to encode task index accessing VMA information 4. VMA's acess information is used to skip scanning for the tasks which had not accessed VMA Changes since V2: patch1: - Renaming of structure, macro to function, - Add explanation to heuristics - Adding more details from result (PeterZ) Patch2: - Usage of test and set bit (PeterZ) - Move storing access PID info to numa_migrate_prep() - Add a note on fainess among tasks allowed to scan (PeterZ) Patch3: - Maintain two windows of access PID information (PeterZ supported implementation and Gave idea to extend to N if needed) Patch4: - Apply hash_32 function to track VMA accessing PIDs (PeterZ) Changes since RFC V1: - Include Mel's vma scan delay patch - Change the accessing pid store logic (Thanks Mel) - Fencing structure / code to NUMA_BALANCING (David, Mel) - Adding clearing access PID logic (Mel) - Descriptive change log ( Mike Rapoport) Things to ponder over: ========================================== - Improvement to clearing accessing PIDs logic (discussed in-detail in patch3 itself (Done in this patchset by implementing 2 window history) - Current scan period is not changed in the patchset, so we do see frequent tries to scan. Relaxing scan period dynamically could improve results further. [1] sched/numa: Process Adaptive autoNUMA Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/ [2] RFC V1 Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/ [3] V2 Link: https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/ Results: Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement is more than 5% and huge system time (80%+) improvement from mmtest autonuma. (dbench had huge std deviation to post) kernbench =========== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean user-256 22002.51 ( 0.00%) 22649.95 * -2.94%* Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%* Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%* Duration User 66017.43 67959.84 Duration System 30503.15 24657.03 Duration Elapsed 504.61 493.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 1738835089.00 1738780310.00 Ops NUMA alloc local 1738834448.00 1738779711.00 Ops NUMA base-page range updates 477310.00 392566.00 Ops NUMA PTE updates 477310.00 392566.00 Ops NUMA hint faults 96817.00 87555.00 Ops NUMA hint local faults % 10150.00 2192.00 Ops NUMA hint local percent 10.48 2.50 Ops NUMA pages migrated 86660.00 85363.00 Ops AutoNUMA cost 489.07 442.14 autonumabench =============== 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%* Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%* Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%* Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%* Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%* Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%* Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%* Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%* Duration User 396433.47 324835.96 Duration System 2808.70 376.66 Duration Elapsed 2258.61 2258.12 6.2.0-mmunstable-base 6.2.0-mmunstable-patched Ops NUMA alloc hit 59921806.00 49623489.00 Ops NUMA alloc miss 0.00 0.00 Ops NUMA interleave hit 0.00 0.00 Ops NUMA alloc local 59920880.00 49622594.00 Ops NUMA base-page range updates 152259275.00 50075.00 Ops NUMA PTE updates 152259275.00 50075.00 Ops NUMA PMD updates 0.00 0.00 Ops NUMA hint faults 154660352.00 39014.00 Ops NUMA hint local faults % 138550501.00 23139.00 Ops NUMA hint local percent 89.58 59.31 Ops NUMA pages migrated 8179067.00 14147.00 Ops AutoNUMA cost 774522.98 195.69 This patch (of 4): Currently whenever a new task is created we wait for sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. Extend the same logic to new or very short-lived VMAs. [raghavendra.kt@amd.com: add initialization in vm_area_dup())] Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Cc: Bharata B Rao <bharata@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Disha Talreja <dishaa.talreja@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: separate vma->lock from vm_area_structSuren Baghdasaryan
vma->lock being part of the vm_area_struct causes performance regression during page faults because during contention its count and owner fields are constantly updated and having other parts of vm_area_struct used during page fault handling next to them causes constant cache line bouncing. Fix that by moving the lock outside of the vm_area_struct. All attempts to keep vma->lock inside vm_area_struct in a separate cache line still produce performance regression especially on NUMA machines. Smallest regression was achieved when lock is placed in the fourth cache line but that bloats vm_area_struct to 256 bytes. Considering performance and memory impact, separate lock looks like the best option. It increases memory footprint of each VMA but that can be optimized later if the new size causes issues. Note that after this change vma_init() does not allocate or initialize vma->lock anymore. A number of drivers allocate a pseudo VMA on the stack but they never use the VMA's lock, therefore it does not need to be allocated. The future drivers which might need the VMA lock should use vm_area_alloc()/vm_area_free() to allocate the VMA. Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/mmap: free vm_area_struct without call_rcu in exit_mmapSuren Baghdasaryan
call_rcu() can take a long time when callback offloading is enabled. Its use in the vm_area_free can cause regressions in the exit path when multiple VMAs are being freed. Because exit_mmap() is called only after the last mm user drops its refcount, the page fault handlers can't be racing with it. Any other possible user like oom-reaper or process_mrelease are already synchronized using mmap_lock. Therefore exit_mmap() can free VMAs directly, without the use of call_rcu(). Expose __vm_area_free() and use it from exit_mmap() to avoid possible call_rcu() floods and performance regressions caused by it. Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: introduce per-VMA lock statisticsSuren Baghdasaryan
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra statistics about handling page fault under VMA lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-29-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: add FAULT_FLAG_VMA_LOCK flagSuren Baghdasaryan
Add a new flag to distinguish page faults handled under protection of per-vma lock. [surenb@google.com: document FAULT_FLAG_VMA_LOCK flag] Link: https://lkml.kernel.org/r/20230301022720.1380780-2-surenb@google.com Link: https://lore.kernel.org/all/20230301113648.7c279865@canb.auug.org.au/ Link: https://lkml.kernel.org/r/20230227173632.3292573-26-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com> Cc: Dan Carpenter <error27@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: introduce lock_vma_under_rcu to be used from arch-specific codeSuren Baghdasaryan
Introduce lock_vma_under_rcu function to lookup and lock a VMA during page fault handling. When VMA is not found, can't be locked or changes after being locked, the function returns NULL. The lookup is performed under RCU protection to prevent the found VMA from being destroyed before the VMA lock is acquired. VMA lock statistics are updated according to the results. For now only anonymous VMAs can be searched this way. In other cases the function returns NULL. Link: https://lkml.kernel.org/r/20230227173632.3292573-24-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: introduce vma detached flagSuren Baghdasaryan
Per-vma locking mechanism will search for VMA under RCU protection and then after locking it, has to ensure it was not removed from the VMA tree after we found it. To make this check efficient, introduce a vma->detached flag to mark VMAs which were removed from the VMA tree. Link: https://lkml.kernel.org/r/20230227173632.3292573-23-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/khugepaged: write-lock VMA while collapsing a huge pageSuren Baghdasaryan
Protect VMA from concurrent page fault handler while collapsing a huge page. Page fault handler needs a stable PMD to use PTL and relies on per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will not be detected by a page fault handler without proper locking. Before this patch, page tables can be walked under any one of the mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged unlinks and frees page tables, it must ensure that all of those either are locked or don't exist. This patch adds a fourth lock under which page tables can be traversed, and so khugepaged must also lock out that one. [surenb@google.com: vm_lock/i_mmap_rwsem inversion in retract_page_tables] Link: https://lkml.kernel.org/r/20230303213250.3555716-1-surenb@google.com [surenb@google.com: build fix] Link: https://lkml.kernel.org/r/CAJuCfpFjWhtzRE1X=J+_JjgJzNKhq-=JT8yTBSTHthwp0pqWZw@mail.gmail.com Link: https://lkml.kernel.org/r/20230227173632.3292573-16-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: mark VMA as being written when changing vm_flagsSuren Baghdasaryan
Updates to vm_flags have to be done with VMA marked as being written for preventing concurrent page faults or other modifications. Link: https://lkml.kernel.org/r/20230227173632.3292573-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: add per-VMA lock and helper functions to control itSuren Baghdasaryan
Introduce per-VMA locking. The lock implementation relies on a per-vma and per-mm sequence counters to note exclusive locking: - read lock - (implemented by vma_start_read) requires the vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ. If they match then there must be a vma exclusive lock held somewhere. - read unlock - (implemented by vma_end_read) is a trivial vma->lock unlock. - write lock - (vma_start_write) requires the mmap_lock to be held exclusively and the current mm counter is assigned to the vma counter. This will allow multiple vmas to be locked under a single mmap_lock write lock (e.g. during vma merging). The vma counter is modified under exclusive vma lock. - write unlock - (vma_end_write_all) is a batch release of all vma locks held. It doesn't pair with a specific vma_start_write! It is done before exclusive mmap_lock is released by incrementing mm sequence counter (mm_lock_seq). - write downgrade - if the mmap_lock is downgraded to the read lock, all vma write locks are released as well (effectivelly same as write unlock). Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move mmap_lock assert function definitionsSuren Baghdasaryan
Move mmap_lock assert function definitions up so that they can be used by other mmap_lock routines. Link: https://lkml.kernel.org/r/20230227173632.3292573-12-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: rcu safe VMA freeingMichel Lespinasse
This prepares for page faults handling under VMA lock, looking up VMAs under protection of an rcu read lock, instead of the usual mmap read lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: vmalloc: convert vread() to vread_iter()Lorenzo Stoakes
Having previously laid the foundation for converting vread() to an iterator function, pull the trigger and do so. This patch attempts to provide minimal refactoring and to reflect the existing logic as best we can, for example we continue to zero portions of memory not read, as before. Overall, there should be no functional difference other than a performance improvement in /proc/kcore access to vmalloc regions. Now we have eliminated the need for a bounce buffer in read_kcore_iter(), we dispense with it, and try to write to user memory optimistically but with faults disabled via copy_page_to_iter_nofault(). We already have preemption disabled by holding a spin lock. We continue faulting in until the operation is complete. Additionally, we must account for the fact that at any point a copy may fail (most likely due to a fault not being able to occur), we exit indicating fewer bytes retrieved than expected. [sfr@canb.auug.org.au: fix sparc64 warning] Link: https://lkml.kernel.org/r/20230320144721.663280c3@canb.auug.org.au [lstoakes@gmail.com: redo Stephen's sparc build fix] Link: https://lkml.kernel.org/r/8506cbc667c39205e65a323f750ff9c11a463798.1679566220.git.lstoakes@gmail.com [akpm@linux-foundation.org: unbreak uio.h includes] Link: https://lkml.kernel.org/r/941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05iov_iter: add copy_page_to_iter_nofault()Lorenzo Stoakes
Provide a means to copy a page to user space from an iterator, aborting if a page fault would occur. This supports compound pages, but may be passed a tail page with an offset extending further into the compound page, so we cannot pass a folio. This allows for this function to be called from atomic context and _try_ to user pages if they are faulted in, aborting if not. The function does not use _copy_to_iter() in order to not specify might_fault(), this is similar to copy_page_from_iter_atomic(). This is being added in order that an iteratable form of vread() can be implemented while holding spinlocks. Link: https://lkml.kernel.org/r/19734729defb0f498a76bdec1bef3ac48a3af3e8.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/page_alloc: make deferred page init free pages in MAX_ORDER blocksKirill A. Shutemov
Normal page init path frees pages during the boot in MAX_ORDER chunks, but deferred page init path does it in pageblock blocks. Change deferred page init path to work in MAX_ORDER blocks. For cases when MAX_ORDER is larger than pageblock, set migrate type to MIGRATE_MOVABLE for all pageblocks covered by the page. Link: https://lkml.kernel.org/r/20230321002415.20843-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entriesLorenzo Stoakes
This functionality's sole user, the drm ttm module, removed support for it in commit 0d979509539e ("drm/ttm: remove ttm_bo_vm_insert_huge()") as the whole approach is currently unworkable without a PMD/PUD special bit and updates to GUP. Link: https://lkml.kernel.org/r/604c2ad79659d4b8a6e3e1611c6219d5d3233988.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: remove unused vmf_insert_mixed_prot()Lorenzo Stoakes
Patch series "Remove drm/ttm-specific mm changes". Functionality was added specifically for the DRM TTM driver to support mapping memory for VM_MIXEDMAP VMAs with customised protection flags, however this has now been rolled back as issues were found with this approach. This series removes the mm changes too, retaining some of the useful comments. This patch (of 3): The sole user of vmf_insert_mixed_prot(), the drm ttm module, stopped using this in commit f91142c62161 ("drm/ttm: nuke VM_MIXEDMAP on BO mappings v3") citing use of VM_MIXEDMAP in this case being terribly broken. Remove this now-dead code and references to it, but retain the useful description of the prot != vma->vm_page_prot case, moving it to vmf_insert_pfn_prot() instead. Link: https://lkml.kernel.org/r/cover.1678661628.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/a069644388e6f1593a7020d15840e6fc9f39bcaf.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/memtest: add results of early memtest to /proc/meminfoTomas Mudrunka
Currently the memtest results were only presented in dmesg. When running a large fleet of devices without ECC RAM it's currently not easy to do bulk monitoring for memory corruption. You have to parse dmesg, but that's a ring buffer so the error might disappear after some time. In general I do not consider dmesg to be a great API to query RAM status. In several companies I've seen such errors remain undetected and cause issues for way too long. So I think it makes sense to provide a monitoring API, so that we can safely detect and act upon them. This adds /proc/meminfo entry which can be easily used by scripts. Link: https://lkml.kernel.org/r/20230321103430.7130-1-tomas.mudrunka@gmail.com Signed-off-by: Tomas Mudrunka <tomas.mudrunka@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move vmalloc_init() declaration to mm/internal.hMike Rapoport (IBM)
vmalloc_init() is called only from mm_core_init(), there is no need to declare it in include/linux/vmalloc.h Move vmalloc_init() declaration to mm/internal.h Link: https://lkml.kernel.org/r/20230321170513.2401534-14-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move kmem_cache_init() declaration to mm/slab.hMike Rapoport (IBM)
kmem_cache_init() is called only from mm_core_init(), there is no need to declare it in include/linux/slab.h Move kmem_cache_init() declaration to mm/slab.h Link: https://lkml.kernel.org/r/20230321170513.2401534-13-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move mem_init_print_info() to mm_init.cMike Rapoport (IBM)
mem_init_print_info() is only called from mm_core_init(). Move it close to the caller and make it static. Link: https://lkml.kernel.org/r/20230321170513.2401534-12-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05init,mm: fold late call to page_ext_init() to page_alloc_init_late()Mike Rapoport (IBM)
When deferred initialization of struct pages is enabled, page_ext_init() must be called after all the deferred initialization is done, but there is no point to keep it a separate call from kernel_init_freeable() right after page_alloc_init_late(). Fold the call to page_ext_init() into page_alloc_init_late() and localize deferred_struct_pages variable. Link: https://lkml.kernel.org/r/20230321170513.2401534-11-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move init_mem_debugging_and_hardening() to mm/mm_init.cMike Rapoport (IBM)
init_mem_debugging_and_hardening() is only called from mm_core_init(). Move it close to the caller, make it static and rename it to mem_debugging_and_hardening_init() for consistency with surrounding convention. Link: https://lkml.kernel.org/r/20230321170513.2401534-10-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: call {ptlock,pgtable}_cache_init() directly from mm_core_init()Mike Rapoport (IBM)
and drop pgtable_init() as it has no real value and its name is misleading. Link: https://lkml.kernel.org/r/20230321170513.2401534-9-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Sergei Shtylyov <sergei.shtylyov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05init,mm: move mm_init() to mm/mm_init.c and rename it to mm_core_init()Mike Rapoport (IBM)
Make mm_init() a part of mm/ codebase. mm_core_init() better describes what the function does and does not clash with mm_init() in kernel/fork.c Link: https://lkml.kernel.org/r/20230321170513.2401534-8-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/page_alloc: rename page_alloc_init() to page_alloc_init_cpuhp()Mike Rapoport (IBM)
The page_alloc_init() name is really misleading because all this function does is sets up CPU hotplug callbacks for the page allocator. Rename it to page_alloc_init_cpuhp() so that name will reflect what the function does. Link: https://lkml.kernel.org/r/20230321170513.2401534-6-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move most of core MM initialization to mm/mm_init.cMike Rapoport (IBM)
The bulk of memory management initialization code is spread all over mm/page_alloc.c and makes navigating through page allocator functionality difficult. Move most of the functions marked __init and __meminit to mm/mm_init.c to make it better localized and allow some more spare room before mm/page_alloc.c reaches 10k lines. No functional changes. Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: move get_page_from_free_area() to mm/page_alloc.cMike Rapoport (IBM)
The get_page_from_free_area() helper is only used in mm/page_alloc.c so move it there to reduce noise in include/linux/mmzone.h Link: https://lkml.kernel.org/r/20230319114214.2133332-1-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEsAxel Rasmussen
UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE to resolve a missing fault, one can install a write-protected one. This is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination. This was motivated by testing HugeTLB HGM [1], and in particular its interaction with userfaultfd features. Existing userfaultfd code supports using WP and MINOR modes together (i.e. you can register an area with both enabled), but without this CONTINUE flag the combination is in practice unusable. So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as UFFDIO_COPY_MODE_WP, but for *minor* faults. Update the selftest to do some very basic exercising of the new flag. Update Documentation/ to describe how these flags are used (neither the COPY nor the new CONTINUE versions of this mode flag were described there before). [1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: userfaultfd: combine 'mode' and 'wp_copy' argumentsAxel Rasmussen
Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy' argument. In future commits we plan to plumb the flags through to more places, so we'd be proliferating the very long argument list even further. Let's take the time to simplify the argument list. Combine the two arguments into one - and generalize, so when we add more flags in the future, it doesn't imply more function arguments. Since the modes (copy, zeropage, continue) are mutually exclusive, store them as an integer value (0, 1, 2) in the low bits. Place combine-able flag bits in the high bits. This is quite similar to an earlier patch proposed by Nadav Amit ("userfaultfd: introduce uffd_flags" [1]). The main difference is that patch only handled flags, whereas this patch *also* combines the "mode" argument into the same type to shorten the argument list. [1]: https://lore.kernel.org/all/20220619233449.181323-2-namit@vmware.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: James Houghton <jthoughton@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: userfaultfd: don't pass around both mm and vmaAxel Rasmussen
Quite a few userfaultfd functions took both mm and vma pointers as arguments. Since the mm is trivially accessible via vma->vm_mm, there's no reason to pass both; it just needlessly extends the already long argument list. Get rid of the mm pointer, where possible, to shorten the argument list. Link: https://lkml.kernel.org/r/20230314221250.682452-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: userfaultfd: rename functions for clarity + consistencyAxel Rasmussen
Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP", v5. - Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the main goal of improving consistency and reducing the number of function args. - Commit 4 adds UFFDIO_CONTINUE_MODE_WP. This patch (of 4): The basic problem is, over time we've added new userfaultfd ioctls, and we've refactored the code so functions which used to handle only one case are now re-used to deal with several cases. While this happened, we didn't bother to rename the functions. Similarly, as we added new functions, we cargo-culted pieces of the now-inconsistent naming scheme, so those functions too ended up with names that don't make a lot of sense. A key point here is, "copy" in most userfaultfd code refers specifically to UFFDIO_COPY, where we allocate a new page and copy its contents from userspace. There are many functions with "copy" in the name that don't actually do this (at least in some cases). So, rename things into a consistent scheme. The high level idea is that the call stack for userfaultfd ioctls becomes: userfaultfd_ioctl -> userfaultfd_(particular ioctl) -> mfill_atomic_(particular kind of fill operation) -> mfill_atomic /* loops over pages in range */ -> mfill_atomic_pte /* deals with single pages */ -> mfill_atomic_pte_(particular kind of fill operation) -> mfill_atomic_install_pte There are of course some special cases (shmem, hugetlb), but this is the general structure which all function names now adhere to. Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>