summaryrefslogtreecommitdiff
path: root/arch/powerpc/platforms/pseries
AgeCommit message (Collapse)Author
2025-01-29Merge tag 'powerpc-6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Madhavan Srinivasan: - Fix to handle PE state in pseries_eeh_get_state() - Handle unset of tce window if it was never set Thanks to Narayana Murty N, Ritesh Harjani (IBM), Shivaprasad G Bhat, and Vaishnavi Bhat. * tag 'powerpc-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/pseries/iommu: Don't unset window if it was never set powerpc/pseries/eeh: Fix get PE state translation
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-26Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-21powerpc/pseries/iommu: Don't unset window if it was never setShivaprasad G Bhat
On pSeries, when user attempts to use the same vfio container used by different iommu group, the spapr_tce_set_window() returns -EPERM and the subsequent cleanup leads to the below crash. Kernel attempted to read user page (308) - exploit attempt? BUG: Kernel NULL pointer dereference on read at 0x00000308 Faulting instruction address: 0xc0000000001ce358 Oops: Kernel access of bad area, sig: 11 [#1] NIP: c0000000001ce358 LR: c0000000001ce05c CTR: c00000000005add0 <snip> NIP [c0000000001ce358] spapr_tce_unset_window+0x3b8/0x510 LR [c0000000001ce05c] spapr_tce_unset_window+0xbc/0x510 Call Trace: spapr_tce_unset_window+0xbc/0x510 (unreliable) tce_iommu_attach_group+0x24c/0x340 [vfio_iommu_spapr_tce] vfio_container_attach_group+0xec/0x240 [vfio] vfio_group_fops_unl_ioctl+0x548/0xb00 [vfio] sys_ioctl+0x754/0x1580 system_call_exception+0x13c/0x330 system_call_vectored_common+0x15c/0x2ec <snip> --- interrupt: 3000 Fix this by having null check for the tbl passed to the spapr_tce_unset_window(). Fixes: f431a8cde7f1 ("powerpc/iommu: Reimplement the iommu_table_group_ops for pSeries") Cc: stable@vger.kernel.org Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/173674009556.1559.12487885286848752833.stgit@linux.ibm.com
2025-01-21powerpc/pseries/eeh: Fix get PE state translationNarayana Murty N
The PE Reset State "0" returned by RTAS calls "ibm_read_slot_reset_[state|state2]" indicates that the reset is deactivated and the PE is in a state where MMIO and DMA are allowed. However, the current implementation of "pseries_eeh_get_state()" does not reflect this, causing drivers to incorrectly assume that MMIO and DMA operations cannot be resumed. The userspace drivers as a part of EEH recovery using VFIO ioctls fail to detect when the recovery process is complete. The VFIO_EEH_PE_GET_STATE ioctl does not report the expected EEH_PE_STATE_NORMAL state, preventing userspace drivers from functioning properly on pseries systems. The patch addresses this issue by updating 'pseries_eeh_get_state()' to include "EEH_STATE_MMIO_ENABLED" and "EEH_STATE_DMA_ENABLED" in the result mask for PE Reset State "0". This ensures correct state reporting to the callers, aligning the behavior with the PAPR specification and fixing the bug in EEH recovery for VFIO user workflows. Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()") Cc: stable@vger.kernel.org Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Narayana Murty N <nnmlinux@linux.ibm.com> Link: https://lore.kernel.org/stable/20241212075044.10563-1-nnmlinux%40linux.ibm.com Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250116103954.17324-1-nnmlinux@linux.ibm.com
2025-01-12powerpc/papr_scm: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @@ constant C; @@ - msecs_to_jiffies(C * 1000) + secs_to_jiffies(C) @@ constant C; @@ - msecs_to_jiffies(C * MSEC_PER_SEC) + secs_to_jiffies(C) Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-5-ddfefd7e9f2a@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Daniel Mack <daniel@zonque.org> Cc: David Airlie <airlied@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jeff Johnson <jjohnson@kernel.org> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jeroen de Borst <jeroendb@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalle Valo <kvalo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Ofir Bitton <obitton@habana.ai> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Praveen Kaligineedi <pkaligineedi@google.com> Cc: Ray Jui <rjui@broadcom.com> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Scott Branden <sbranden@broadcom.com> Cc: Shailend Chand <shailend@google.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Simon Horman <horms@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-11powerpc/pseries/iommu: IOMMU incorrectly marks MMIO range in DDWGaurav Batra
Power Hypervisor can possibily allocate MMIO window intersecting with Dynamic DMA Window (DDW) range, which is over 32-bit addressing. These MMIO pages needs to be marked as reserved so that IOMMU doesn't map DMA buffers in this range. The current code is not marking these pages correctly which is resulting in LPAR to OOPS while booting. The stack is at below BUG: Unable to handle kernel data access on read at 0xc00800005cd40000 Faulting instruction address: 0xc00000000005cdac Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: af_packet rfkill ibmveth(X) lpfc(+) nvmet_fc nvmet nvme_keyring crct10dif_vpmsum nvme_fc nvme_fabrics nvme_core be2net(+) nvme_auth rtc_generic nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse configfs ip_tables x_tables xfs libcrc32c dm_service_time ibmvfc(X) scsi_transport_fc vmx_crypto gf128mul crc32c_vpmsum dm_mirror dm_region_hash dm_log dm_multipath dm_mod sd_mod scsi_dh_emc scsi_dh_rdac scsi_dh_alua t10_pi crc64_rocksoft_generic crc64_rocksoft sg crc64 scsi_mod Supported: Yes, External CPU: 8 PID: 241 Comm: kworker/8:1 Kdump: loaded Not tainted 6.4.0-150600.23.14-default #1 SLE15-SP6 b44ee71c81261b9e4bab5e0cde1f2ed891d5359b Hardware name: IBM,9080-M9S POWER9 (raw) 0x4e2103 0xf000005 of:IBM,FW950.B0 (VH950_149) hv:phyp pSeries Workqueue: events work_for_cpu_fn NIP: c00000000005cdac LR: c00000000005e830 CTR: 0000000000000000 REGS: c00001400c9ff770 TRAP: 0300 Not tainted (6.4.0-150600.23.14-default) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24228448 XER: 00000001 CFAR: c00000000005cdd4 DAR: c00800005cd40000 DSISR: 40000000 IRQMASK: 0 GPR00: c00000000005e830 c00001400c9ffa10 c000000001987d00 c00001400c4fe800 GPR04: 0000080000000000 0000000000000001 0000000004000000 0000000000800000 GPR08: 0000000004000000 0000000000000001 c00800005cd40000 ffffffffffffffff GPR12: 0000000084228882 c00000000a4c4f00 0000000000000010 0000080000000000 GPR16: c00001400c4fe800 0000000004000000 0800000000000000 c00000006088b800 GPR20: c00001401a7be980 c00001400eff3800 c000000002a2da68 000000000000002b GPR24: c0000000026793a8 c000000002679368 000000000000002a c0000000026793c8 GPR28: 000008007effffff 0000080000000000 0000000000800000 c00001400c4fe800 NIP [c00000000005cdac] iommu_table_reserve_pages+0xac/0x100 LR [c00000000005e830] iommu_init_table+0x80/0x1e0 Call Trace: [c00001400c9ffa10] [c00000000005e810] iommu_init_table+0x60/0x1e0 (unreliable) [c00001400c9ffa90] [c00000000010356c] iommu_bypass_supported_pSeriesLP+0x9cc/0xe40 [c00001400c9ffc30] [c00000000005c300] dma_iommu_dma_supported+0xf0/0x230 [c00001400c9ffcb0] [c00000000024b0c4] dma_supported+0x44/0x90 [c00001400c9ffcd0] [c00000000024b14c] dma_set_mask+0x3c/0x80 [c00001400c9ffd00] [c0080000555b715c] be_probe+0xc4/0xb90 [be2net] [c00001400c9ffdc0] [c000000000986f3c] local_pci_probe+0x6c/0x110 [c00001400c9ffe40] [c000000000188f28] work_for_cpu_fn+0x38/0x60 [c00001400c9ffe70] [c00000000018e454] process_one_work+0x314/0x620 [c00001400c9fff10] [c00000000018f280] worker_thread+0x2b0/0x620 [c00001400c9fff90] [c00000000019bb18] kthread+0x148/0x150 [c00001400c9fffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18 There are 2 issues in the code 1. The index is "int" while the address is "unsigned long". This results in negative value when setting the bitmap. 2. The DMA offset is page shifted but the MMIO range is used as-is (64-bit address). MMIO address needs to be page shifted as well. Fixes: 3c33066a2190 ("powerpc/kernel/iommu: Add new iommu_table_in_use() helper") Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241206210039.93172-1-gbatra@linux.ibm.com
2024-11-27powerpc/machdep: Remove duplicated include in svm.cYang Li
The header files linux/mem_encrypt.h is included twice in svm.c, so one inclusion of each can be removed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=11750 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241107010259.46308-1-yang.lee@linux.alibaba.com
2024-11-23Merge tag 'powerpc-6.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM. - Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation". - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS, DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines. - Add support for running KVM nested guests on Power11. - Other small features, cleanups and fixes. Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao. * tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits) EDAC/powerpc: Remove PPC_MAPLE drivers powerpc/perf: Add per-task/process monitoring to vpa_pmu driver powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu powerpc/perf: Add perf interface to expose vpa counters MAINTAINERS: powerpc: Mark Maddy as "M" powerpc/Makefile: Allow overriding CPP powerpc-km82xx.c: replace of_node_put() with __free ps3: Correct some typos in comments powerpc/kexec: Fix return of uninitialized variable macintosh: Use common error handling code in via_pmu_led_init() powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type() powerpc: remove dead config options for MPC85xx platform support powerpc/xive: Use cpumask_intersects() selftests/powerpc: Remove the path after initialization. powerpc/xmon: symbol lookup length fixed powerpc/ep8248e: Use %pa to format resource_size_t powerpc/ps3: Reorganize kerneldoc parameter names KVM: PPC: Book3S HV: Fix kmv -> kvm typo powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static ...
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-21Merge tag 'dma-mapping-6.13-2024-11-19' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - improve the DMA API tracing code (Sean Anderson) - misc cleanups (Christoph Hellwig, Sui Jingfeng) - fix pointer abuse when finding the shared DMA pool (Geert Uytterhoeven) - fix a deadlock in dma-debug (Levi Yun) * tag 'dma-mapping-6.13-2024-11-19' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: save base/size instead of pointer to shared DMA pool dma-mapping: fix swapped dir/flags arguments to trace_dma_alloc_sgt_err dma-mapping: drop unneeded includes from dma-mapping.h dma-mapping: trace more error paths dma-mapping: use trace_dma_alloc for dma_alloc* instead of using trace_dma_map dma-mapping: trace dma_alloc/free direction dma-mapping: use macros to define events in a class dma-mapping: remove an outdated comment from dma-map-ops.h dma-debug: remove DMA_API_DEBUG_SG dma-debug: store a phys_addr_t in struct dma_debug_entry dma-debug: fix a possible deadlock on radix_lock
2024-11-20Merge tag 'devicetree-for-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree updates from Rob Herring: "Bindings: - Enable dtc "interrupt_provider" warnings for binding examples. Fix the warnings in fsl,mu-msi and ti,sci-inta due to this. - Convert zii,rave-sp-wdt, zii,rave-sp-pwrbutton, and altr,fpga-passive-serial to DT schema format - Add some documentation on the different forms of YAML text blocks which are a constant source of review comments - Fix some schema errors in constraints for arrays - Add compatibles for qcom,sar2130p-pdc and onnn,adt7462 DT core: - Allow overlay kunit tests to run CONFIG_OF_OVERLAY=n - Add some warnings on deprecated address handling - Rework early_init_dt_scan() so the arch can pass in the phys address of the DTB as __pa() is not always valid to use. This fixes a warning for arm64 with kexec. - Add and use some new DT graph iterators for iterating over ports and endpoints - Rework reserved-memory handling to be sized dynamically for fixed regions - Optimize of_modalias() to avoid a strlen() call - Constify struct device_node and property pointers where ever possible" * tag 'devicetree-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (36 commits) of: Allow overlay kunit tests to run CONFIG_OF_OVERLAY=n dt-bindings: interrupt-controller: qcom,pdc: Add SAR2130P compatible of/address: Rework bus matching to avoid warnings of: WARN on deprecated #address-cells/#size-cells handling of/fdt: Don't use default address cell sizes for address translation dt-bindings: Enable dtc "interrupt_provider" warnings of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verify dt-bindings: cache: qcom,llcc: Fix X1E80100 reg entries dt-bindings: watchdog: convert zii,rave-sp-wdt.txt to yaml format dt-bindings: input: convert zii,rave-sp-pwrbutton.txt to yaml media: xilinx-tpg: use new of_graph functions fbdev: omapfb: use new of_graph functions gpu: drm: omapdrm: use new of_graph functions ASoC: audio-graph-card2: use new of_graph functions ASoC: audio-graph-card: use new of_graph functions ASoC: test-component: use new of_graph functions of: property: use new of_graph functions of: property: add of_graph_get_next_port_endpoint() of: property: add of_graph_get_next_port() of: module: remove strlen() call in of_modalias() ...
2024-11-19powerpc/perf: Add perf interface to expose vpa countersKajol Jain
To support performance measurement for KVM on PowerVM(KoP) feature, PowerVM hypervisor has added couple of new software counters in Virtual Process Area(VPA) of the partition. Commit e1f288d2f9c69 ("KVM: PPC: Book3S HV nestedv2: Add support for reading VPA counters for pseries guests") have updated the paca fields with corresponding changes. Proposed perf interface is to expose these new software counters for monitoring of context switch latencies and runtime aggregate. Perf interface driver is called "vpa_pmu" and it has dependency on KVM and perf, hence added new config called "VPA_PMU" which depends on "CONFIG_KVM_BOOK3S_64_HV" and "CONFIG_HV_PERF_CTRS". Since, kvm and kvm_host are currently compiled as built-in modules, this perf interface takes the same path and registered as a module. vpa_pmu perf interface needs access to some of the kvm functions and structures like kvmhv_get_l2_counters_status(), hence kvm_book3s_64.h and kvm_ppc.h are included. Below are the events added to monitor KoP: vpa_pmu/l1_to_l2_lat/ vpa_pmu/l2_to_l1_lat/ vpa_pmu/l2_runtime_agg/ and vpa_pmu driver supports only per-cpu monitoring with this patch. Example usage: [command]# perf stat -e vpa_pmu/l1_to_l2_lat/ -a -I 1000 1.001017682 727,200 vpa_pmu/l1_to_l2_lat/ 2.003540491 1,118,824 vpa_pmu/l1_to_l2_lat/ 3.005699458 1,919,726 vpa_pmu/l1_to_l2_lat/ 4.007827011 2,364,630 vpa_pmu/l1_to_l2_lat/ Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Co-developed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241118114114.208964-1-kjain@linux.ibm.com
2024-11-07asm-generic: introduce text-patching.hMike Rapoport (Microsoft)
Several architectures support text patching, but they name the header files that declare patching functions differently. Make all such headers consistently named text-patching.h and add an empty header in asm-generic for architectures that do not support text patching. Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-02powerpc: Split systemcfg struct definitions out from vdsoThomas Weißschuh
The systemcfg data has nothing to do anymore with the vdso. Split it into a dedicated header file. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-27-b64f0842d512@linutronix.de
2024-11-02powerpc: Split systemcfg data out of vdso data pageThomas Weißschuh
The systemcfg data only has minimal overlap with the vdso data. Splitting the two avoids mapping the implementation-defined vdso data into /proc/ppc64/systemcfg. It is also a preparation for the standardization of vdso data storage. The only field actually used by both systemcfg and vdso is tb_ticks_per_sec and it is only changed once during time_init(). Initialize it in both structures there. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-26-b64f0842d512@linutronix.de
2024-11-02powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processorsThomas Weißschuh
The systemcfg processorCount variable tracks currently online variables, not possible ones, so the stored value is wrong. The code preferably tries to use the ibm,lrdr-capacity field 4 which "represents the maximum number of processors that the guest can have." Switch from processorCount to the better matching num_possible_cpus(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-24-b64f0842d512@linutronix.de
2024-11-02powerpc/pseries/lparcfg: Fix printing of system_active_processorsThomas Weißschuh
When printing the information "system_active_processors", the variable partition_potential_processors is used instead of partition_active_processors. The wrong value is displayed. Use partition_active_processors instead. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-23-b64f0842d512@linutronix.de
2024-10-29of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verifyUsama Arif
__pa() is only intended to be used for linear map addresses and using it for initial_boot_params which is in fixmap for arm64 will give an incorrect value. Hence save the physical address when it is known at boot time when calling early_init_dt_scan for arm64 and use it at kexec time instead of converting the virtual address using __pa(). Note that arm64 doesn't need the FDT region reserved in the DT as the kernel explicitly reserves the passed in FDT. Therefore, only a debug warning is fixed with this change. Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20241023171426.452688-1-usamaarif642@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-10-29powerpc/pseries: Fix dtl_access_lock to be a rw_semaphoreMichael Ellerman
The dtl_access_lock needs to be a rw_sempahore, a sleeping lock, because the code calls kmalloc() while holding it, which can sleep: # echo 1 > /proc/powerpc/vcpudispatch_stats BUG: sleeping function called from invalid context at include/linux/sched/mm.h:337 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 199, name: sh preempt_count: 1, expected: 0 3 locks held by sh/199: #0: c00000000a0743f8 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x324/0x438 #1: c0000000028c7058 (dtl_enable_mutex){+.+.}-{3:3}, at: vcpudispatch_stats_write+0xd4/0x5f4 #2: c0000000028c70b8 (dtl_access_lock){+.+.}-{2:2}, at: vcpudispatch_stats_write+0x220/0x5f4 CPU: 0 PID: 199 Comm: sh Not tainted 6.10.0-rc4 #152 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries Call Trace: dump_stack_lvl+0x130/0x148 (unreliable) __might_resched+0x174/0x410 kmem_cache_alloc_noprof+0x340/0x3d0 alloc_dtl_buffers+0x124/0x1ac vcpudispatch_stats_write+0x2a8/0x5f4 proc_reg_write+0xf4/0x150 vfs_write+0xfc/0x438 ksys_write+0x88/0x148 system_call_exception+0x1c4/0x5a0 system_call_common+0xf4/0x258 Fixes: 06220d78f24a ("powerpc/pseries: Introduce rwlock to gatekeep DTLB usage") Tested-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Nysal Jan K.A <nysal@linux.ibm.com> Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20240819122401.513203-1-mpe@ellerman.id.au
2024-10-29powerpc/machdep: Drop include of dma-mapping.hMichael Ellerman
Drop the include of dma-mapping.h in machdep.h, replace it with forward declarations of struct device and struct pci_dev, and include time64.h and page.h which are required for time64_t and pgprot_t respectively. Add direct includes of some other headers to some files that were getting them via machdep.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051826.132805-2-mpe@ellerman.id.au
2024-10-29powerpc/machdep: Drop include of seq_file.hMichael Ellerman
Drop the include of seq_file.h in machdep.h, replace it with a forward declaration of struct seq_file, which is all that's required. Add direct includes of seq_file.h to some files that were getting seq_file.h via machdep.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051826.132805-1-mpe@ellerman.id.au
2024-10-29dma-mapping: drop unneeded includes from dma-mapping.hChristoph Hellwig
Back in the day a lot of logic was implemented inline in dma-mapping.h and needed various includes. Move of this has long been moved out of line, so we can drop various includes to improve kernel rebuild times. Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-10powerpc: Switch back to struct platform_driver::remove()Uwe Kleine-König
After commit 0edb555a65d1 ("platform: Make platform_driver::remove() return void") .remove() is (again) the right callback to implement for platform drivers. Convert all pwm drivers to use .remove(), with the eventual goal to drop struct platform_driver::remove_new(). As .remove() and .remove_new() have the same prototypes, conversion is done by just changing the structure member name in the driver initializer. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240909130902.851274-2-u.kleine-koenig@baylibre.com
2024-09-10powerpc/pseries/eeh: Fix pseries_eeh_err_injectNarayana Murty N
VFIO_EEH_PE_INJECT_ERR ioctl is currently failing on pseries due to missing implementation of err_inject eeh_ops for pseries. This patch implements pseries_eeh_err_inject in eeh_ops/pseries eeh_ops. Implements support for injecting MMIO load/store error for testing from user space. The check on PCI error type (bus type) code is moved to platform code, since the eeh_pe_inject_err can be allowed to more error types depending on platform requirement. Removal of the check for 'type' in eeh_pe_inject_err() doesn't impact PowerNV as pnv_eeh_err_inject() already has an equivalent check in place. Signed-off-by: Narayana Murty N <nnmlinux@linux.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240909140220.529333-1-nnmlinux@linux.ibm.com
2024-09-05powerpc: Stop using no_llseekMichael Ellerman
Since commit 868941b14441 ("fs: remove no_llseek"), no_llseek() is simply defined to be NULL, and a NULL llseek means seeking is unsupported. So for statically defined file_operations, such as all these, there's no need or benefit to set llseek = no_llseek. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240903111951.141376-1-mpe@ellerman.id.au
2024-09-05powerpc: pseries: Constify struct kobj_typeHuang Xiaojia
'struct kobj_type' is not modified. It is only used in kobject_init() which takes a 'const struct kobj_type *ktype' parameter. Constifying this structure moves some data to a read-only section, so increase over all security. On a x86_64, compiled with ppc64 defconfig: Before: ====== text data bss dec hex filename 1885 368 16 2269 8dd arch/powerpc/platforms/pseries/vas-sysfs.o After: ====== text data bss dec hex filename 1981 272 16 2269 8dd arch/powerpc/platforms/pseries/vas-sysfs.o Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240826150957.3500237-3-huangxiaojia2@huawei.com
2024-09-01mm: kvmalloc: align kvrealloc() with krealloc()Danilo Krummrich
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-30powerpc/pseries/dlpar: Add device tree nodes for DLPAR IO addHaren Myneni
In the powerpc-pseries specific implementation, the IO hotplug event is handled in the user space (drmgr tool). For the DLPAR IO ADD, the corresponding device tree nodes and properties will be added to the device tree after the device enable. The user space (drmgr tool) uses configure_connector RTAS call with the DRC index to retrieve the device nodes and updates the device tree by writing to /proc/ppc64/ofdt. Under system lockdown, /dev/mem access to allocate buffers for configure_connector RTAS call is restricted which means the user space can not issue this RTAS call and also can not access to /proc/ppc64/ofdt. The pseries implementation need user interaction to power-on and add device to the slot during the ADD event handling. So adds complexity if the complete hotplug ADD event handling moved to the kernel. To overcome /dev/mem access restriction, this patch extends the /sys/kernel/dlpar interface and provides ‘dt add index <drc_index>’ to the user space. The drmgr tool uses this interface to update the device tree whenever the device is added. This interface retrieves device tree nodes for the corresponding DRC index using the configure_connector RTAS call and adds new device nodes / properties to the device tree. Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com> Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240822025028.938332-3-haren@linux.ibm.com
2024-08-30powerpc/pseries/dlpar: Remove device tree node for DLPAR IO removeHaren Myneni
In the powerpc-pseries specific implementation, the IO hotplug event is handled in the user space (drmgr tool). But update the device tree and /dev/mem access to allocate buffers for some RTAS calls are restricted when the kernel lockdown feature is enabled. For the DLPAR IO REMOVE, the corresponding device tree nodes and properties have to be removed from the device tree after the device disable. The user space removes the device tree nodes by updating /proc/ppc64/ofdt which is not allowed under system lockdown is enabled. This restriction can be resolved by moving the complete IO hotplug handling in the kernel. But the pseries implementation need user interaction to power off and to remove device from the slot during hotplug event handling. To overcome the /proc/ppc64/ofdt restriction, this patch extends the /sys/kernel/dlpar interface and provides ‘dt remove index <drc_index>’ to the user space so that drmgr tool can remove the corresponding device tree nodes based on DRC index from the device tree. Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com> Signed-off-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240822025028.938332-2-haren@linux.ibm.com
2024-08-30powerpc/pseries: Use correct data types from pseries_hp_errorlog structHaren Myneni
_be32 type is defined for some elements in pseries_hp_errorlog struct but also used them u32 after be32_to_cpu() conversion. Example: In handle_dlpar_errorlog() hp_elog->_drc_u.drc_index = be32_to_cpu(hp_elog->_drc_u.drc_index); And later assigned to u32 type dlpar_cpu() - u32 drc_index = hp_elog->_drc_u.drc_index; This incorrect usage is giving the following warnings and the patch resolve these warnings with the correct assignment. arch/powerpc/platforms/pseries/dlpar.c:398:53: sparse: sparse: incorrect type in argument 1 (different base types) @@ expected unsigned int [usertype] drc_index @@ got restricted __be32 [usertype] drc_index @@ ... arch/powerpc/platforms/pseries/dlpar.c:418:43: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __be32 [usertype] drc_count @@ got unsigned int [usertype] @@ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408182142.wuIKqYae-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202408182302.o7QRO45S-lkp@intel.com/ Signed-off-by: Haren Myneni <haren@linux.ibm.com> v3: - Fix warnings from using incorrect data types in pseries_hp_errorlog struct v2: - Remove pr_info() and TODO comments - Update more information in the commit logs Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240822025028.938332-1-haren@linux.ibm.com
2024-08-29powerpc/pseries/dlpar: Use helper function for_each_child_of_node()Zhang Zekun
for_each_child_of_node can help to iterate through the device_node, and we don't need to use while loop. No functional change with this conversion. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240822085430.25753-3-zhangzekun11@huawei.com
2024-07-25Merge tag 'driver-core-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core changes for 6.11-rc1. Lots of stuff in here, with not a huge diffstat, but apis are evolving which required lots of files to be touched. Highlights of the changes in here are: - platform remove callback api final fixups (Uwe took many releases to get here, finally!) - Rust bindings for basic firmware apis and initial driver-core interactions. It's not all that useful for a "write a whole driver in rust" type of thing, but the firmware bindings do help out the phy rust drivers, and the driver core bindings give a solid base on which others can start their work. There is still a long way to go here before we have a multitude of rust drivers being added, but it's a great first step. - driver core const api changes. This reached across all bus types, and there are some fix-ups for some not-common bus types that linux-next and 0-day testing shook out. This work is being done to help make the rust bindings more safe, as well as the C code, moving toward the end-goal of allowing us to put driver structures into read-only memory. We aren't there yet, but are getting closer. - minor devres cleanups and fixes found by code inspection - arch_topology minor changes - other minor driver core cleanups All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits) ARM: sa1100: make match function take a const pointer sysfs/cpu: Make crash_hotplug attribute world-readable dio: Have dio_bus_match() callback take a const * zorro: make match function take a const pointer driver core: module: make module_[add|remove]_driver take a const * driver core: make driver_find_device() take a const * driver core: make driver_[create|remove]_file take a const * firmware_loader: fix soundness issue in `request_internal` firmware_loader: annotate doctests as `no_run` devres: Correct code style for functions that return a pointer type devres: Initialize an uninitialized struct member devres: Fix memory leakage caused by driver API devm_free_percpu() devres: Fix devm_krealloc() wasting memory driver core: platform: Switch to use kmemdup_array() driver core: have match() callback in struct bus_type take a const * MAINTAINERS: add Rust device abstractions to DRIVER CORE device: rust: improve safety comments MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER firmware: rust: improve safety comments ...
2024-07-19Merge tag 'powerpc-6.11-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Remove support for 40x CPUs & platforms - Add support to the 64-bit BPF JIT for cpu v4 instructions - Fix PCI hotplug driver crash on powernv - Fix doorbell emulation for KVM on PAPR guests (nestedv2) - Fix KVM nested guest handling of some less used SPRs - Online NUMA nodes with no CPU/memory if they have a PCI device attached - Reduce memory overhead of enabling kfence on 64-bit Radix MMU kernels - Reimplement the iommu table_group_ops for pseries for VFIO SPAPR TCE Thanks to: Anjali K, Artem Savkov, Athira Rajeev, Breno Leitao, Brian King, Celeste Liu, Christophe Leroy, Esben Haabendal, Gaurav Batra, Gautam Menghani, Haren Myneni, Hari Bathini, Jeff Johnson, Krishna Kumar, Krzysztof Kozlowski, Nathan Lynch, Nicholas Piggin, Nick Bowler, Nilay Shroff, Rob Herring (Arm), Shawn Anastasio, Shivaprasad G Bhat, Sourabh Jain, Srikar Dronamraju, Timothy Pearson, Uwe Kleine-König, and Vaibhav Jain. * tag 'powerpc-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (57 commits) Documentation/powerpc: Mention 40x is removed powerpc: Remove 40x leftovers macintosh/therm_windtunnel: fix module unload. powerpc: Check only single values are passed to CPU/MMU feature checks powerpc/xmon: Fix disassembly CPU feature checks powerpc: Drop clang workaround for builtin constant checks powerpc64/bpf: jit support for signed division and modulo powerpc64/bpf: jit support for sign extended mov powerpc64/bpf: jit support for sign extended load powerpc64/bpf: jit support for unconditional byte swap powerpc64/bpf: jit support for 32bit offset jmp instruction powerpc/pci: Hotplug driver bridge support pci/hotplug/pnv_php: Fix hotplug driver crash on Powernv powerpc/configs: Update defconfig with now user-visible CONFIG_FSL_IFC powerpc: add missing MODULE_DESCRIPTION() macros macintosh/mac_hid: add MODULE_DESCRIPTION() KVM: PPC: add missing MODULE_DESCRIPTION() macros powerpc/kexec: Use of_property_read_reg() powerpc/64s/radix/kfence: map __kfence_pool at page granularity powerpc/pseries/iommu: Define spapr_tce_table_group_ops only with CONFIG_IOMMU_API ...
2024-07-04powerpc: add missing MODULE_DESCRIPTION() macrosJeff Johnson
With ARCH=powerpc, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/kernel/rtas_flash.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/sysdev/rtc_cmos_setup.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/platforms/pseries/papr_scm.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/platforms/cell/spufs/spufs.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/platforms/cell/cbe_thermal.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/platforms/cell/cpufreq_spudemand.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/powerpc/platforms/cell/cbe_powerbutton.o Add the missing invocation of the MODULE_DESCRIPTION() macro to all files which have a MODULE_LICENSE(). This includes 85xx/t1042rdb_diu.c and chrp/nvram.c which, although they did not produce a warning with the powerpc allmodconfig configuration, may cause this warning with other configurations. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240615-md-powerpc-arch-powerpc-v1-1-ba4956bea47a@quicinc.com
2024-07-04powerpc/pseries/iommu: Define spapr_tce_table_group_ops only with ↵Shivaprasad G Bhat
CONFIG_IOMMU_API The patch fixes the below warning: arch/powerpc/platforms/pseries/iommu.c:1824:37: warning: 'spapr_tce_table_group_ops' defined but not used Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407020357.Hz8kQkKf-lkp@intel.com/ Fixes: b09c031d9433 ("powerpc/iommu: Move pSeries specific functions to pseries/iommu.c") Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/172008854222.784.13666247605789409729.stgit@linux.ibm.com
2024-07-03driver core: have match() callback in struct bus_type take a const *Greg Kroah-Hartman
In the match() callback, the struct device_driver * should not be changed, so change the function callback to be a const *. This is one step of many towards making the driver core safe to have struct device_driver in read-only memory. Because the match() callback is in all busses, all busses are modified to handle this properly. This does entail switching some container_of() calls to container_of_const() to properly handle the constant *. For some busses, like PCI and USB and HV, the const * is cast away in the match callback as those busses do want to modify those structures at this point in time (they have a local lock in the driver structure.) That will have to be changed in the future if they wish to have their struct device * in read-only-memory. Cc: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Alex Elder <elder@kernel.org> Acked-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-28powerpc/pseries: Fix scv instruction crash with kexecNicholas Piggin
kexec on pseries disables AIL (reloc_on_exc), required for scv instruction support, before other CPUs have been shut down. This means they can execute scv instructions after AIL is disabled, which causes an interrupt at an unexpected entry location that crashes the kernel. Change the kexec sequence to disable AIL after other CPUs have been brought down. As a refresher, the real-mode scv interrupt vector is 0x17000, and the fixed-location head code probably couldn't easily deal with implementing such high addresses so it was just decided not to support that interrupt at all. Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv instructions") Cc: stable@vger.kernel.org # v5.9+ Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com> Closes: https://lore.kernel.org/3b4b2943-49ad-4619-b195-bc416f1d1409@linux.ibm.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Link: https://msgid.link/20240625134047.298759-1-npiggin@gmail.com Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2024-06-28powerpc/iommu: Reimplement the iommu_table_group_ops for pSeriesShivaprasad G Bhat
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs, their ownership transfer, create/set/unset the TCE tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for support on POWER. The commit 9d67c9433509 ("powerpc/iommu: Add "borrowing" iommu_table_group_ops") implemented partial support for this API with "borrow" mechanism wherein the DMA windows if created already by the host driver, they would be available for VFIO to use. Also, it didn't have the support to control/modify the window size or the IO page size. The current patch implements all the necessary iommu_table_group_ops APIs there by avoiding the "borrrowing". So, just the way it is on the PowerNV platform, with this patch the iommu table group ownership is transferred to the VFIO PPC subdriver, the iommu table, DMA windows creation/deletion all driven through the APIs. The pSeries uses the query-pe-dma-window, create-pe-dma-window and reset-pe-dma-window RTAS calls for DMA window creation, deletion and reset to defaul. The RTAs calls do show some minor differences to the way things are to be handled on the pSeries which are listed below. * On pSeries, the default DMA window size is "fixed" cannot be custom sized as requested by the user. For non-SRIOV VFs, It is fixed at 2GB and for SRIOV VFs, its variable sized based on the capacity assigned to it during the VF assignment to the LPAR. So, for the default DMA window alone the size if requested less than tce32_size, the smaller size is enforced using the iommu table->it_size. * The DMA start address for 32-bit window is 0, and for the 64-bit window in case of PowerNV is hardcoded to TVE select (bit 59) at 512PiB offset. This address is returned at the time of create_table() API call (even before the window is created), the subsequent set_window() call actually opens the DMA window. On pSeries, the DMA start address for 32-bit window is known from the 'ibm,dma-window' DT property. However, the 64-bit window start address is not known until the create-pe-dma RTAS call is made. So, the create_table() which returns the DMA window start address actually opens the DMA window and returns the DMA start address as returned by the Hypervisor for the create-pe-dma RTAS call. * The reset-pe-dma RTAS call resets the DMA windows and restores the default DMA window, however it does not clear the TCE table entries if there are any. In case of ownership transfer from platform domain which used direct mapping, the patch chooses remove-pe-dma instead of reset-pe for the 64-bit window intentionally so that the clear_dma_window() is called. Other than the DMA window management changes mentioned above, the patch also brings back the userspace view for the single level TCE as it existed before commit 090bad39b237a ("powerpc/powernv: Add indirect levels to it_userspace") along with the relavent refactoring. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923275958.1397.907964437142542242.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Use the iommu table[0] for IOV VF's DDWShivaprasad G Bhat
This patch basically brings consistency with PowerNV approach to use the first freely available iommu table when the default window is removed. The pSeries iommu code convention has been that the table[0] is for the default 32 bit DMA window and the table[1] is for the 64 bit DDW. With VFs having only 1 DMA window, the default has to be removed for creating the larger DMA window. The existing code uses the table[1] for that, while marking the table[0] as NULL. This is fine as long as the host driver itself uses the device. For the VFIO user, on pSeries there is no way to skip table[0] as the VFIO subdriver uses the first freely available table. The window 0, when created as 64-bit DDW in that context would still be on table[0], as the maximum number of windows is 1. This won't have any impact for the host driver as the table is fetched from the device's iommu_table_base. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> [mpe: Rebase and resolve conflicts] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923272328.1397.1817843961216868850.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Fix the VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl outputShivaprasad G Bhat
The ioctl VFIO_IOMMU_SPAPR_TCE_GET_INFO is not reporting the actuals on the platform as not all the details are correctly collected during the platform probe/scan into the iommu_table_group. Collect the information during the device setup time as the DMA window property is already looked up on parent nodes anyway. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923271138.1397.7908302630061814623.stgit@linux.ibm.com
2024-06-28powerpc/iommu: Move pSeries specific functions to pseries/iommu.cShivaprasad G Bhat
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c. The pSeries specific table_group_ops are sitting in the generic powerpc file. Move it to where it actually belong(pseries/iommu.c). The functions are currently defined even for CONFIG_PPC_POWERNV which are unused on PowerNV. Only code movement, no functional changes intended. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923269701.1397.15758640002786937132.stgit@linux.ibm.com
2024-06-23powerpc/pseries: Whitelist dtl slub object for copying to userspaceAnjali K
Reading the dispatch trace log from /sys/kernel/debug/powerpc/dtl/cpu-* results in a BUG() when the config CONFIG_HARDENED_USERCOPY is enabled as shown below. kernel BUG at mm/usercopy.c:102! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth pseries_wdt dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 27 PID: 1815 Comm: python3 Not tainted 6.10.0-rc3 #85 Hardware name: IBM,9040-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_042) hv:phyp pSeries NIP: c0000000005d23d4 LR: c0000000005d23d0 CTR: 00000000006ee6f8 REGS: c000000120c078c0 TRAP: 0700 Not tainted (6.10.0-rc3) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 2828220f XER: 0000000e CFAR: c0000000001fdc80 IRQMASK: 0 [ ... GPRs omitted ... ] NIP [c0000000005d23d4] usercopy_abort+0x78/0xb0 LR [c0000000005d23d0] usercopy_abort+0x74/0xb0 Call Trace: usercopy_abort+0x74/0xb0 (unreliable) __check_heap_object+0xf8/0x120 check_heap_object+0x218/0x240 __check_object_size+0x84/0x1a4 dtl_file_read+0x17c/0x2c4 full_proxy_read+0x8c/0x110 vfs_read+0xdc/0x3a0 ksys_read+0x84/0x144 system_call_exception+0x124/0x330 system_call_vectored_common+0x15c/0x2ec --- interrupt: 3000 at 0x7fff81f3ab34 Commit 6d07d1cd300f ("usercopy: Restrict non-usercopy caches to size 0") requires that only whitelisted areas in slab/slub objects can be copied to userspace when usercopy hardening is enabled using CONFIG_HARDENED_USERCOPY. Dtl contains hypervisor dispatch events which are expected to be read by privileged users. Hence mark this safe for user access. Specify useroffset=0 and usersize=DISPATCH_LOG_BYTES to whitelist the entire object. Co-developed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Anjali K <anjalik@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240614173844.746818-1-anjalik@linux.ibm.com
2024-06-04powerpc/pseries/vas: Use usleep_range() to support HCALL delayHaren Myneni
VAS allocate, modify and deallocate HCALLs returns H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy delay and expects OS to reissue HCALL after that delay. But using msleep() will often sleep at least 20 msecs even though the hypervisor suggests OS reissue these HCALLs after 1 or 10msecs. The open and close VAS window functions hold mutex and then issue these HCALLs. So these operations can take longer than the necessary when multiple threads issue open or close window APIs simultaneously, especially might affect the performance in the case of repeat open/close APIs for each compression request. Multiple tasks can open / close VAS windows at the same time which depends on the available VAS credits. For example, 240 cores system provides 4800 VAS credits. It means 4800 tasks can execute open VAS windows HCALLs with the mutex. Since each msleep() will often sleep more than 20 msecs, some tasks are waiting more than 120 secs to acquire mutex. It can cause hung traces for these tasks in dmesg due to mutex contention around open/close HCALLs. Instead of msleep(), use usleep_range() to ensure sleep with the expected value before issuing HCALL again. So since each task sleep 10 msecs maximum, this patch allow more tasks can issue open/close VAS calls without any hung traces in the dmesg. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Suggested-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240116055910.421605-1-haren@linux.ibm.com
2024-06-04powerpc/numa: Online a node if PHB is attached.Nilay Shroff
In the current design, a numa-node is made online only if that node is attached to cpu/memory. With this design, if any PCI/IO device is found to be attached to a numa-node which is not online then the numa-node id of the corresponding PCI/IO device is set to NUMA_NO_NODE(-1). This design may negatively impact the performance of PCIe device if the numa-node assigned to PCIe device is -1 because in such case we may not be able to accurately calculate the distance between two nodes. The multi-controller NVMe PCIe disk has an issue with calculating the node distance if the PCIe NVMe controller is attached to a PCI host bridge which has numa-node id value set to NUMA_NO_NODE. This patch helps fix this ensuring that a cpu/memory less numa node is made online if it's attached to PCI host bridge. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240517142531.3273464-3-nilay@linux.ibm.com
2024-06-04powerpc/pseries/iommu: Split Dynamic DMA Window to be used in Hybrid modeGaurav Batra
Dynamic DMA Window (DDW) supports TCEs that are backed by 2MB page size. In most configurations, DDW is big enough to pre-map all of LPAR memory for IO. Pre-mapping of memory for DMA results in improvements in IO performance. Persistent memory, vPMEM, can be assigned to an LPAR as well. vPMEM is not contiguous with LPAR memory and usually is assigned at high memory addresses. This makes is not possible to pre-map both vPMEM and LPAR memory in the same DDW. For a dedicated adapter this limitation is not an issue. Dedicated adapters can have both Default DMA window, which is backed by 4K page size and a DDW backed by 2MB page size TCEs. In this scenario, LPAR memory is pre-mapped in the DDW. Any DMA going to the vPMEM is routed via dynamically allocated TCEs in the default window. The issue arises with SR-IOV adapters. There is only one DMA window - either Default or DDW. If an LPAR has vPMEM assigned, memory is not pre-mapped in the DDW since TCEs needs to be allocated for vPMEM as well. In this case, DDW is created and TCEs are dynamically allocated for both vPMEM and LPAR memory. Today, DDW is only used in single mode - direct mapped TCEs or dynamically mapped TCEs. This enhancement breaks a single DDW in 2 regions - 1. First region to pre-map LPAR memory 2. Second region to dynamically allocate TCEs for IO to vPMEM The DDW is split only if it is big enough to pre-map complete LPAR memory and still have some space left to dynamically map vPMEM. Maximum size possible DDW is created as permitted by the Hypervisor. Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240514014608.35537-1-gbatra@linux.ibm.com
2024-05-30powerpc/pseries/lparcfg: drop error message from guest name lookupNathan Lynch
It's not an error or exceptional situation when the hosting environment does not expose a name for the LP/guest via RTAS or the device tree. This happens with qemu when run without the '-name' option. The message also lacks a newline. Remove it. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: eddaa9a40275 ("powerpc/pseries: read the lpar name from the firmware") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240524-lparcfg-updates-v2-1-62e2e9d28724@linux.ibm.com
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...