linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-01-26	Merge tag 'arm-fixes-6.8-1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull arm SoC fixes from Arnd Bergmann: "There are a couple of devicetree fixes for samsung, riscv/sophgo, and for TPM device nodes on a couple of platforms. Both the Arm FF-A and the SCMI firmware drivers get a number of code fixes, addressing minor implementation bugs and compatibility with firmware implementations. Most of these bugs relate to the usage of xarray and rwlock structures and are fixed by Cristian Marussi" * tag 'arm-fixes-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: riscv: dts: sophgo: separate sg2042 mtime and mtimecmp to fit aclint format arm64: dts: Fix TPM schema violations ARM: dts: Fix TPM schema violations ARM: dts: exynos4212-tab3: add samsung,invert-vclk flag to fimd arm64: dts: exynos: gs101: comply with the new cmu_misc clock names firmware: arm_ffa: Handle partitions setup failures firmware: arm_ffa: Use xa_insert() and check for result firmware: arm_ffa: Simplify ffa_partitions_cleanup() firmware: arm_ffa: Check xa_load() return value firmware: arm_ffa: Add missing rwlock_init() for the driver partition firmware: arm_ffa: Add missing rwlock_init() in ffa_setup_partitions() firmware: arm_scmi: Fix the clock protocol supported version firmware: arm_scmi: Fix the clock protocol version for v3.2 firmware: arm_scmi: Use xa_insert() when saving raw queues firmware: arm_scmi: Use xa_insert() to store opps firmware: arm_scmi: Replace asm-generic/bug.h with linux/bug.h firmware: arm_scmi: Check mailbox/SMT channel for consistency
2024-01-26	Merge tag 'spi-fix-v6.8-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "As well as a few device IDs and the usual scattering of driver specific fixes this contains a couple of core things. One is a missed case in error handling, the other patch is a change from me raising the number of chip selects allowed by the newly added multi chip select support patches to resolve problems seen on several systems that exceeded the limit. This is not a real solution to the issue but rather just a change to avoid disruption to users, one of the options I am considering is just sending a revert of those changes if we can't come up with something sensible" * tag 'spi-fix-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: fix finalize message on error return spi: cs42l43: Handle error from devm_pm_runtime_enable spi: Raise limit on number of chip selects spi: hisi-sfc-v3xx: Return IRQ_NONE if no interrupts were detected spi: spi-cadence: Reverse the order of interleaved write and read operations spi: spi-imx: Use dev_err_probe for failed DMA channel requests spi: bcm-qspi: fix SFDP BFPT read by usig mspi read spi: intel-pci: Add support for Arrow Lake SPI serial flash spi: intel-pci: Remove Meteor Lake-S SoC PCI ID from the list
2024-01-26	Merge tag 'gpio-fixes-for-v6.8-rc2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - add a quirk to GPIO ACPI handling to ignore touchpad wakeups on GPD G1619-04 - clear interrupt status bits (that may have been set before enabling the interrupts) after setting the interrupt type in gpio-eic-sprd * tag 'gpio-fixes-for-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: eic-sprd: Clear interrupt after set the interrupt type gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-04
2024-01-26	Merge tag 'media/v6.8-3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - remove K3 DT prefix from wave5 - vb2 core: fix missing caps on VIDIO_CREATE_BUFS under certain circumstances - videobuf2: Stop direct calls to queue num_buffers field * tag 'media/v6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: vb2: refactor setting flags and caps, fix missing cap media: media videobuf2: Stop direct calls to queue num_buffers field media: chips-media: wave5: Remove K3 References dt-bindings: media: Remove K3 Family Prefix from Compatible
2024-01-26	tracing/trigger: Fix to return error if failed to alloc snapshot	Masami Hiramatsu (Google)
	Fix register_snapshot_trigger() to return error code if it failed to allocate a snapshot instead of 0 (success). Unless that, it will register snapshot trigger without an error. Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2 Fixes: 0bbe7f719985 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation") Cc: stable@vger.kernel.org Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-26	platform/x86: touchscreen_dmi: Add info for the TECLAST X16 Plus tablet	Phoenix Chen
	Add touch screen info for TECLAST X16 Plus tablet. Signed-off-by: Phoenix Chen <asbeltogf@gmail.com> Link: https://lore.kernel.org/r/20240126095308.5042-1-asbeltogf@gmail.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-01-26	platform/x86/intel/ifs: Call release_firmware() when handling errors.	Jithu Joseph
	Missing release_firmware() due to error handling blocked any future image loading. Fix the return code and release_fiwmare() to release the bad image. Fixes: 25a76dbb36dd ("platform/x86/intel/ifs: Validate image size") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20240125082254.424859-2-ashok.raj@intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-01-26	platform/x86/amd/pmf: Fix memory leak in amd_pmf_get_pb_data()	Cong Liu
	amd_pmf_get_pb_data() will allocate memory for the policy buffer, but does not free it if copy_from_user() fails. This leads to a memory leak. Fixes: 10817f28e533 ("platform/x86/amd/pmf: Add capability to sideload of policy binary") Reviewed-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Signed-off-by: Cong Liu <liucong2@kylinos.cn> Link: https://lore.kernel.org/r/20240124012939.6550-1-liucong2@kylinos.cn Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-01-26	platform/x86/amd/pmf: Get ambient light information from AMD SFH driver	Shyam Sundar S K
	AMD SFH driver has APIs defined to export the ambient light information; use this within the PMF driver to send inputs to the PMF TA, so that PMF driver can enact to the actions coming from the TA. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240123141458.3715211-2-Shyam-sundar.S-k@amd.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-01-26	platform/x86/amd/pmf: Get Human presence information from AMD SFH driver	Shyam Sundar S K
	AMD SFH driver has APIs defined to export the human presence information; use this within the PMF driver to send inputs to the PMF TA, so that PMF driver can enact to the actions coming from the TA. Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240123141458.3715211-1-Shyam-sundar.S-k@amd.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-01-26	Merge branch 'pm-cpufreq'	Rafael J. Wysocki
	Merge cpufreq fixes for 6.8-rc2: - Fix the handling of scaling_max/min_freq sysfs attributes in the AMD P-state cpufreq driver (Mario Limonciello). - Make the intel_pstate cpufreq driver avoid unnecessary computation of the HWP performance level corresponding to a given frequency in the cases when it is known already, which also helps to avoid reducing the maximum CPU capacity artificially on some systems (Rafael J. Wysocki). * pm-cpufreq: cpufreq/amd-pstate: Fix setting scaling max/min freq values cpufreq: intel_pstate: Refine computation of P-state for given frequency
2024-01-27	Merge tag 'drm-misc-fixes-for-v6.8-rc2' of ↵	Dave Airlie
	git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes One regression fixup to samsung-dsim.c module - The FORCE_STOP_STATE bit is ineffective for forcing DSI link into LP-11 mode, causing timing issues and potential bridge failures. This patch reverts previous commits and corrects this issue. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Inki Dae <inki.dae@samsung.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240126141130.15512-1-inki.dae@samsung.com
2024-01-27	Revert "nouveau: push event block/allowing out of the fence context"	Dave Airlie
	This reverts commit eacabb5462717a52fccbbbba458365a4f5e61f35. This commit causes some regressions in desktop usage, this will reintroduce the original deadlock in DRI_PRIME situations, I've got an idea to fix it by offloading to a workqueue in a different spot, however this code has a race condition where we sometimes miss interrupts so I'd like to fix that as well. Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-01-27	Merge tag 'drm-intel-fixes-2024-01-26' of ↵	Dave Airlie
	git://anongit.freedesktop.org/drm/drm-intel into drm-fixes - PSR fix for HSW Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZbPGBL9lj4DxxIW1@jlahtine-mobl.ger.corp.intel.com
2024-01-27	Merge tag 'drm-misc-fixes-2024-01-26' of ↵	Dave Airlie
	git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Plenty of ivpu fixes to improve the general stability and debugging, a suspend fix for the anx7625 bridge, a revert to fix an initialization order bug between i915 and simpledrm and a documentation warning fix for dp_mst. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/tp77e5fokigup6cgmpq6mtg46kzdw2dpze6smpnwfoml4kmwpq@bo6mbkezpkle
2024-01-26	MAINTAINERS: Add Andreas Larsson as co-maintainer for arch/sparc	Andreas Larsson
	Dave has not been very active on arch/sparc for the past two years. I have been contributing to the SPARC32 port as well as maintaining out-of-tree SPARC32 patches for LEON3/4/5 (SPARCv8 with CAS support) since 2012. I am willing to step up as an arch/sparc (co-)maintainer. For recent discussions on the matter, see [1] and [2]. [1] https://lore.kernel.org/r/20230713075235.2164609-1-u.kleine-koenig@pengutronix.de [2] https://lore.kernel.org/r/20231209105816.GA1085691@ravnborg.org/ Signed-off-by: Andreas Larsson <andreas@gaisler.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Acked-by: Jose E. Marchesi <jose.marchesi@oracle.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-01-26	drm: bridge: samsung-dsim: Don't use FORCE_STOP_STATE	Michael Walle
	The FORCE_STOP_STATE bit is unsuitable to force the DSI link into LP-11 mode. It seems the bridge internally queues DSI packets and when the FORCE_STOP_STATE bit is cleared, they are sent in close succession without any useful timing (this also means that the DSI lanes won't go into LP-11 mode). The length of this gibberish varies between 1ms and 5ms. This sometimes breaks an attached bridge (TI SN65DSI84 in this case). In our case, the bridge will fail in about 1 per 500 reboots. The FORCE_STOP_STATE handling was introduced to have the DSI lanes in LP-11 state during the .pre_enable phase. But as it turns out, none of this is needed at all. Between samsung_dsim_init() and samsung_dsim_set_display_enable() the lanes are already in LP-11 mode. The code as it was before commit 20c827683de0 ("drm: bridge: samsung-dsim: Fix init during host transfer") and 0c14d3130654 ("drm: bridge: samsung-dsim: Fix i.MX8M enable flow to meet spec") was correct in this regard. This patch basically reverts both commits. It was tested on an i.MX8M SoC with an SN65DSI84 bridge. The signals were probed and the DSI packets were decoded during initialization and link start-up. After this patch the first DSI packet on the link is a VSYNC packet and the timing is correct. Command mode between .pre_enable and .enable was also briefly tested by a quick hack. There was no DSI link partner which would have responded, but it was made sure the DSI packet was send on the link. As a side note, the command mode seems to just work in HS mode. I couldn't find that the bridge will handle commands in LP mode. Fixes: 20c827683de0 ("drm: bridge: samsung-dsim: Fix init during host transfer") Fixes: 0c14d3130654 ("drm: bridge: samsung-dsim: Fix i.MX8M enable flow to meet spec") Signed-off-by: Michael Walle <mwalle@kernel.org> Signed-off-by: Inki Dae <inki.dae@samsung.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231113164344.1612602-1-mwalle@kernel.org
2024-01-26	riscv: dts: sophgo: separate sg2042 mtime and mtimecmp to fit aclint format	Inochi Amaoto
	Change the timer layout in the dtb to fit the format that needed by the SBI. Fixes: 967a94a92aaa ("riscv: dts: add initial Sophgo SG2042 SoC device tree") Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Reviewed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Inochi Amaoto <inochiama@outlook.com> Signed-off-by: Chen Wang <unicorn_wang@outlook.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-01-26	erofs: fix infinite loop due to a race of filling compressed_bvecs	Gao Xiang
	I encountered a race issue after lengthy (~594647 secs) stress tests on a 64k-page arm64 VM with several 4k-block EROFS images. The timing is like below: z_erofs_try_inplace_io z_erofs_fill_bio_vec cmpxchg(&compressed_bvecs[].page, NULL, ..) [access bufvec] compressed_bvecs[] = *bvec; Previously, z_erofs_submit_queue() just accessed bufvec->page only, so other fields in bufvec didn't matter. After the subpage block support is landed, .offset and .end can be used too, but filling bufvec isn't an atomic operation which can cause inconsistency. Let's use a spinlock to keep the atomicity of each bufvec. More specifically, just reuse the existing spinlock `pcl->obj.lockref.lock` since it's rarely used (also it takes a short time if even used) as long as the pcluster has a reference. Fixes: 192351616a9d ("erofs: support I/O submission for sub-page compressed blocks") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Link: https://lore.kernel.org/r/20240125120039.3228103-1-hsiangkao@linux.alibaba.com
2024-01-26	MIPS: lantiq: register smp_ops on non-smp platforms	Aleksander Jan Bajkowski
	Lantiq uses a common kernel config for devices with 24Kc and 34Kc cores. The changes made previously to add support for interrupts on all cores work on 24Kc platforms with SMP disabled and 34Kc platforms with SMP enabled. This patch fixes boot issues on Danube (single core 24Kc) with SMP enabled. Fixes: 730320fd770d ("MIPS: lantiq: enable all hardware interrupts on second VPE") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2024-01-26	MIPS: loongson64: set nid for reserved memblock region	Huang Pei
	Commit 61167ad5fecd("mm: pass nid to reserve_bootmem_region()") reveals that reserved memblock regions have no valid node id set, just set it right since loongson64 firmware makes it clear in memory layout info. This works around booting failure on 3A1000+ since commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") under CONFIG_DEFERRED_STRUCT_PAGE_INIT. Signed-off-by: Huang Pei <huangpei@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2024-01-26	MIPS: reserve exception vector space ONLY ONCE	Huang Pei
	"cpu_probe" is called both by BP and APs, but reserving exception vector (like 0x0-0x1000) called by "cpu_probe" need once and calling on APs is too late since memblock is unavailable at that time. So, reserve exception vector ONLY by BP. Suggested-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Huang Pei <huangpei@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2024-01-26	MIPS: BCM63XX: Fix missing prototypes	Florian Fainelli
	Most of the symbols for which we do not have a prototype can actually be made static and for the few that cannot, there is already a declaration in a header for it. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2024-01-26	mm: thp_get_unmapped_area must honour topdown preference	Ryan Roberts
	The addition of commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") caused the "virtual_address_range" mm selftest to start failing on arm64. Let's fix that regression. There were 2 visible problems when running the test; 1) it takes much longer to execute, and 2) the test fails. Both are related: The (first part of the) test allocates as many 1GB anonymous blocks as it can in the low 256TB of address space, passing NULL as the addr hint to mmap. Before the faulty patch, all allocations were abutted and contained in a single, merged VMA. However, after this patch, each allocation is in its own VMA, and there is a 2M gap between each VMA. This causes the 2 problems in the test: 1) mmap becomes MUCH slower because there are so many VMAs to check to find a new 1G gap. 2) mmap fails once it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then causes a subsequent calloc() to fail, which causes the test to fail. The problem is that arm64 (unlike x86) selects ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates len+2M then always aligns to the bottom of the discovered gap. That causes the 2M hole. Fix this by detecting cases where we can still achive the alignment goal when moved to the top of the allocated area, if configured to prefer top-down allocation. While we are at it, fix thp_get_unmapped_area's use of pgoff, which should always be zero for anonymous mappings. Prior to the faulty change, while it was possible for user space to pass in pgoff!=0, the old mm->get_unmapped_area() handler would not use it. thp_get_unmapped_area() does use it, so let's explicitly zero it before calling the handler. This should also be the correct behavior for arches that define their own get_unmapped_area() handler. Link: https://lkml.kernel.org/r/20240123171420.3970220-1-ryan.roberts@arm.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Closes: https://lore.kernel.org/linux-mm/1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-26	crypto: caam - fix asynchronous hash	Gaurav Jain
	ahash_alg->setkey is updated to ahash_nosetkey in ahash.c so checking setkey() function to determine hmac algorithm is not valid. to fix this added is_hmac variable in structure caam_hash_alg to determine whether the algorithm is hmac or not. Fixes: 2f1f34c1bf7b ("crypto: ahash - optimize performance when wrapping shash") Signed-off-by: Gaurav Jain <gaurav.jain@nxp.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-01-26	crypto: qat - fix arbiter mapping generation algorithm for QAT 402xx	Damian Muszynski
	The commit "crypto: qat - generate dynamically arbiter mappings" introduced a regression on qat_402xx devices. This is reported when the driver probes the device, as indicated by the following error messages: 4xxx 0000:0b:00.0: enabling device (0140 -> 0142) 4xxx 0000:0b:00.0: Generate of the thread to arbiter map failed 4xxx 0000:0b:00.0: Direct firmware load for qat_402xx_mmp.bin failed with error -2 The root cause of this issue was the omission of a necessary function pointer required by the mapping algorithm during the implementation. Fix it by adding the missing function pointer. Fixes: 5da6a2d5353e ("crypto: qat - generate dynamically arbiter mappings") Signed-off-by: Damian Muszynski <damian.muszynski@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-01-26	LoongArch: KVM: Add returns to SIMD stubs	Randy Dunlap
	The stubs for kvm_own/lsx()/kvm_own_lasx() when CONFIG_CPU_HAS_LSX or CONFIG_CPU_HAS_LASX is not defined should have a return value since they return an int, so add "return -EINVAL;" to the stubs. Fixes the build error: In file included from ../arch/loongarch/include/asm/kvm_csr.h:12, from ../arch/loongarch/kvm/interrupt.c:8: ../arch/loongarch/include/asm/kvm_vcpu.h: In function 'kvm_own_lasx': ../arch/loongarch/include/asm/kvm_vcpu.h:73:39: error: no return statement in function returning non-void [-Werror=return-type] 73 \| static inline int kvm_own_lasx(struct kvm_vcpu *vcpu) { } Fixes: db1ecca22edf ("LoongArch: KVM: Add LSX (128bit SIMD) support") Fixes: 118e10cd893d ("LoongArch: KVM: Add LASX (256bit SIMD) support") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-26	LoongArch: KVM: Fix build due to API changes	Huacai Chen
	Commit 8569992d64b8f750e34b7858eac ("KVM: Use gfn instead of hva for mmu_notifier_retry") replaces mmu_invalidate_retry_hva() usage with mmu_invalidate_retry_gfn() for X86, LoongArch also need similar changes to fix build. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-26	LoongArch/smp: Call rcutree_report_cpu_starting() at tlb_init()	Huacai Chen
	Machines which have more than 8 nodes fail to boot SMP after commit a2ccf46333d7b2cf96 ("LoongArch/smp: Call rcutree_report_cpu_starting() earlier"). Because such machines use tlb-based per-cpu base address rather than dmw-based per-cpu base address, resulting per-cpu variables can only be accessed after tlb_init(). But rcutree_report_cpu_starting() is now called before tlb_init() and accesses per-cpu variables indeed. Since the original patch want to avoid the lockdep warning caused by page allocation in tlb_init(), we can move rcutree_report_cpu_starting() to tlb_init() where after tlb exception configuration but before page allocation. Fixes: a2ccf46333d7b2cf96 ("LoongArch/smp: Call rcutree_report_cpu_starting() earlier") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-25	mm: huge_memory: don't force huge page alignment on 32 bit	Yang Shi
	commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") caused two issues [1] [2] reported on 32 bit system or compat userspace. It doesn't make too much sense to force huge page alignment on 32 bit system due to the constrained virtual address space. [1] https://lore.kernel.org/linux-mm/d0a136a0-4a31-46bc-adf4-2db109a61672@kernel.org/ [2] https://lore.kernel.org/linux-mm/CAJuCfpHXLdQy1a2B6xN2d7quTYwg2OoZseYPZTRpU0eHHKD-sQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240118180505.2914778-1-shy828301@gmail.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: Jiri Slaby <jirislaby@kernel.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Jiri Slaby <jirislaby@kernel.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb	Lokesh Gidra
	In mfill_atomic_hugetlb(), mmap_changing isn't being checked again if we drop mmap_lock and reacquire it. When the lock is not held, mmap_changing could have been incremented. This is also inconsistent with the behavior in mfill_atomic(). Link: https://lkml.kernel.org/r/20240117223729.1444522-1-lokeshgidra@google.com Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory	Ryan Roberts
	ksm_tests was previously mmapping a region of memory, aligning the returned pointer to a PMD boundary, then setting MADV_HUGEPAGE, but was setting it past the end of the mmapped area due to not taking the pointer alignment into consideration. Fix this behaviour. Up until commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries"), this buggy behavior was (usually) masked because the alignment difference was always less than PMD-size. But since the mentioned commit, `ksm_tests -H -s 100` started failing. Link: https://lkml.kernel.org/r/20240122120554.3108022-1-ryan.roberts@arm.com Fixes: 325254899684 ("selftests: vm: add KSM huge pages merging time test") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	scs: add CONFIG_MMU dependency for vfree_atomic()	Samuel Holland
	The shadow call stack implementation fails to build without CONFIG_MMU: ld.lld: error: undefined symbol: vfree_atomic >>> referenced by scs.c >>> kernel/scs.o:(scs_free) in archive vmlinux.a Link: https://lkml.kernel.org/r/20240122175204.2371009-1-samuel.holland@sifive.com Fixes: a2abe7cbd8fe ("scs: switch to vmapped shadow stacks") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()	David Hildenbrand
	The correct folio replacement for "set_page_dirty()" is "folio_mark_dirty()", not "folio_set_dirty()". Using the latter won't properly inform the FS using the dirty_folio() callback. This has been found by code inspection, but likely this can result in some real trouble when zapping dirty PTEs that point at clean pagecache folios. Yuezhang Mo said: "Without this fix, testing the latest exfat with xfstests, test cases generic/029 and generic/030 will fail." Link: https://lkml.kernel.org/r/20240122171751.272074-1-david@redhat.com Fixes: c46265030b0f ("mm/memory: page_remove_rmap() -> folio_remove_rmap_pte()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lkml.kernel.org/r/2445cedb-61fb-422c-8bfb-caf0a2beed62@arm.com Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty()	David Hildenbrand
	The correct folio replacement for "set_page_dirty()" is "folio_mark_dirty()", not "folio_set_dirty()". Using the latter won't properly inform the FS using the dirty_folio() callback. This has been found by code inspection, but likely this can result in some real trouble. Link: https://lkml.kernel.org/r/20240122175407.307992-1-david@redhat.com Fixes: a8e61d584eda0 ("mm/huge_memory: page_remove_rmap() -> folio_remove_rmap_pmd()") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag	Audra Mitchell
	In order for the page table level 5 to be in use, the CPU must have the setting enabled in addition to the CONFIG option. Check for the flag to be set to avoid false test failures on systems that do not have this cpu flag set. The test does a series of mmap calls including three using the MAP_FIXED flag and specifying an address that is 1<<47 or 1<<48. These addresses are only available if you are using level 5 page tables, which requires both the CPU to have the capabiltiy (la57 flag) and the kernel to be configured. Currently the test only checks for the kernel configuration option, so this test can still report a false positive. Here are the three failing lines: $ ./va_high_addr_switch \| grep FAILED mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED mmap(HIGH_ADDR, MAP_FIXED): 0xffffffffffffffff - FAILED mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED I thought (for about a second) refactoring the test so that these three mmap calls will only be run on systems with the level 5 page tables available, but the whole point of the test is to check the level 5 feature... Link: https://lkml.kernel.org/r/20240119205801.62769-1-audra@redhat.com Fixes: 4f2930c6718a ("selftests/vm: only run 128TBswitch with 5-level paging") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Adam Sindelar <adam@wowsignal.io> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	selftests: mm: fix map_hugetlb failure on 64K page size systems	Nico Pache
	On systems with 64k page size and 512M huge page sizes, the allocation and test succeeds but errors out at the munmap. As the comment states, munmap will failure if its not HUGEPAGE aligned. This is due to the length of the mapping being 1/2 the size of the hugepage causing the munmap to not be hugepage aligned. Fix this by making the mapping length the full hugepage if the hugepage is larger than the length of the mapping. Link: https://lkml.kernel.org/r/20240119131429.172448-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	MAINTAINERS: supplement of zswap maintainers update	Yosry Ahmed
	As discussed on the mailing list [1], merge the zpool maintainers entry into the zswap one. Also, add CREDITS entries for previous zswap/zpool maintainers. [1] https://lore.kernel.org/linux-mm/CAJD7tkYx4YWhGoVwnSeGc8dY_1aRRxxg8PzWBV==A6iqG_OgFw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240117182152.1439822-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjenning@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	stackdepot: make fast paths lock-less again	Marco Elver
	With the introduction of the pool_rwlock (reader-writer lock), several fast paths end up taking the pool_rwlock as readers. Furthermore, stack_depot_put() unconditionally takes the pool_rwlock as a writer. Despite allowing readers to make forward-progress concurrently, reader-writer locks have inherent cache contention issues, which does not scale well on systems with large CPU counts. Rework the synchronization story of stack depot to again avoid taking any locks in the fast paths. This is done by relying on RCU-protected list traversal, and the NMI-safe subset of RCU to delay reuse of freed stack records. See code comments for more details. Along with the performance issues, this also fixes incorrect nesting of rwlock within a raw_spinlock, given that stack depot should still be usable from anywhere: \| [ BUG: Invalid wait context ] \| ----------------------------- \| swapper/0/1 is trying to lock: \| ffffffff89869be8 (pool_rwlock){..--}-{3:3}, at: stack_depot_save_flags \| other info that might help us debug this: \| context-{5:5} \| 2 locks held by swapper/0/1: \| #0: ffffffff89632440 (rcu_read_lock){....}-{1:3}, at: __queue_work \| #1: ffff888100092018 (&pool->lock){-.-.}-{2:2}, at: __queue_work <-- raw_spin_lock Stack depot usage stats are similar to the previous version after a KASAN kernel boot: $ cat /sys/kernel/debug/stackdepot/stats pools: 838 allocations: 29865 frees: 6604 in_use: 23261 freelist_size: 1879 The number of pools is the same as previously. The freelist size is minimally larger, but this may also be due to variance across system boots. This shows that even though we do not eagerly wait for the next RCU grace period (such as with synchronize_rcu() or call_rcu()) after freeing a stack record - requiring depot_pop_free() to "poll" if an entry may be used - new allocations are very likely to happen in later RCU grace periods. Link: https://lkml.kernel.org/r/20240118110216.2539519-2-elver@google.com Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	stackdepot: add stats counters exported via debugfs	Marco Elver
	Add a few basic stats counters for stack depot that can be used to derive if stack depot is working as intended. This is a snapshot of the new stats after booting a system with a KASAN-enabled kernel: $ cat /sys/kernel/debug/stackdepot/stats pools: 838 allocations: 29861 frees: 6561 in_use: 23300 freelist_size: 1840 Generally, "pools" should be well below the max; once the system is booted, "in_use" should remain relatively steady. Link: https://lkml.kernel.org/r/20240118110216.2539519-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm, kmsan: fix infinite recursion due to RCU critical section	Marco Elver
	Alexander Potapenko writes in [1]: "For every memory access in the code instrumented by KMSAN we call kmsan_get_metadata() to obtain the metadata for the memory being accessed. For virtual memory the metadata pointers are stored in the corresponding `struct page`, therefore we need to call virt_to_page() to get them. According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr) returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to call virt_addr_valid() as well. To avoid recursion, kmsan_get_metadata() must not call instrumented code, therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c to check whether a virtual address is valid or not. But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(), so there is an infinite recursion now. I do not think it is correct to stop that recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in kmsan_get_metadata(): that would prevent instrumented functions called from within the runtime from tracking the shadow values, which might introduce false positives." Fix the issue by switching pfn_valid() to the _sched() variant of rcu_read_lock/unlock(), which does not require calling into RCU. Given the critical section in pfn_valid() is very small, this is a reasonable trade-off (with preemptible RCU). KMSAN further needs to be careful to suppress calls into the scheduler, which would be another source of recursion. This can be done by wrapping the call to pfn_valid() into preempt_disable/enable_no_resched(). The downside is that this sacrifices breaking scheduling guarantees; however, a kernel compiled with KMSAN has already given up any performance guarantees due to being heavily instrumented. Note, KMSAN code already disables tracing via Makefile, and since mmzone.h is included, it is not necessary to use the notrace variant, which is generally preferred in all other cases. Link: https://lkml.kernel.org/r/20240115184430.2710652-1-glider@google.com [1] Link: https://lkml.kernel.org/r/20240118110022.2538350-1-elver@google.com Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: syzbot+93a9e8a3dea8d6085e12@syzkaller.appspotmail.com Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again	Zach O'Keefe
	(struct dirty_throttle_control )->thresh is an unsigned long, but is passed as the u32 divisor argument to div_u64(). On architectures where unsigned long is 64 bytes, the argument will be implicitly truncated. Use div64_u64() instead of div_u64() so that the value used in the "is this a safe division" check is the same as the divisor. Also, remove redundant cast of the numerator to u64, as that should happen implicitly. This would be difficult to exploit in memcg domain, given the ratio-based arithmetic domain_drity_limits() uses, but is much easier in global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using e.g. vm.dirty_bytes=(1<<32)PAGE_SIZE so that dtc->thresh == (1<<32) Link: https://lkml.kernel.org/r/20240118181954.1415197-1-zokeefe@google.com Fixes: f6789593d5ce ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Maxim Patlasov <MPatlasov@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	selftests/mm: switch to bash from sh	Muhammad Usama Anjum
	Running charge_reserved_hugetlb.sh generates errors if sh is set to dash: ./charge_reserved_hugetlb.sh: 9: [[: not found ./charge_reserved_hugetlb.sh: 19: [[: not found ./charge_reserved_hugetlb.sh: 27: [[: not found ./charge_reserved_hugetlb.sh: 37: [[: not found ./charge_reserved_hugetlb.sh: 45: Syntax error: "(" unexpected Switch to using /bin/bash instead of /bin/sh. Make the switch for write_hugetlb_memory.sh as well which is called from charge_reserved_hugetlb.sh. Link: https://lkml.kernel.org/r/20240116090455.3407378-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	MAINTAINERS: add man-pages git trees	Petr Vorel
	The maintainer uses both. Link: https://lkml.kernel.org/r/20240117122257.2707637-1-pvorel@suse.cz Signed-off-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Alejandro Colomar <alx@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm: memcontrol: don't throttle dying tasks on memory.high	Johannes Weiner
	While investigating hosts with high cgroup memory pressures, Tejun found culprit zombie tasks that had were holding on to a lot of memory, had SIGKILL pending, but were stuck in memory.high reclaim. In the past, we used to always force-charge allocations from tasks that were exiting in order to accelerate them dying and freeing up their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: prohibit unconditional exceeding the limit of dying tasks"); it noted that this can cause (userspace inducable) containment failures, so it added a mandatory reclaim and OOM kill cycle before forcing charges. At the time, memory.high enforcement was handled in the userspace return path, which isn't reached by dying tasks, and so memory.high was still never enforced by dying tasks. When c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") added synchronous reclaim for memory.high, it added unconditional memory.high enforcement for dying tasks as well. The callstack shows that this path is where the zombie is stuck in. We need to accelerate dying tasks getting past memory.high, but we cannot do it quite the same way as we do for memory.max: memory.max is enforced strictly, and tasks aren't allowed to move past it without FIRST reclaiming and OOM killing if necessary. This ensures very small levels of excess. With memory.high, though, enforcement happens lazily after the charge, and OOM killing is never triggered. A lot of concurrent threads could have pushed, or could actively be pushing, the cgroup into excess. The dying task will enter reclaim on every allocation attempt, with little hope of restoring balance. To fix this, skip synchronous memory.high enforcement on dying tasks altogether again. Update memory.high path documentation while at it. [hannes@cmpxchg.org: also handle tasks are being killed during the reclaim] Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	mm: mmap: map MAP_STACK to VM_NOHUGEPAGE	Yang Shi
	commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") incured regression for stress-ng pthread benchmark [1]. It is because THP get allocated to pthread's stack area much more possible than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. The MAP_STACK flag is used to mark the stack area, but it is a no-op on Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating THP for such stack area. With this change the stack area looks like: fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 Size: 8192 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 12 kB Pss: 12 kB Pss_Dirty: 12 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 12 kB Referenced: 12 kB Anonymous: 12 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd wr mr mw me ac nh The "nh" flag is set. [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20231221065943.2803551-2-shy828301@gmail.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	uprobes: use pagesize-aligned virtual address when replacing pages	David Hildenbrand
	uprobes passes an unaligned page mapping address to folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we recently extended in commit 372cbd4d5a066 ("mm: non-pmd-mappable, large folios for folio_add_new_anon_rmap()"). Arguably, this is uprobes code doing something wrong; however, for the time being it would have likely worked in rmap code because __folio_set_anon() would set folio->index to the same value. Looking at __replace_page(), we'd also pass slightly wrong values to mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(), ptep_clear_flush() and set_pte_at_notify(). I suspect most of them are fine, but let's just mark the introducing commit as the one needed fixing. I don't think CC stable is warranted. We'll add more sanity checks in rmap code separately, to make sure that we always get properly aligned addresses. Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com Fixes: c517ee744b96 ("uprobes: __replace_page() should not use page_address_in_vma()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Olsa <jolsa@kernel.org> Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava Tested-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	selftests/mm: mremap_test: fix build warning	Muhammad Usama Anjum
	Use 2 separate variables of types int and unsigned long long instead of confusing them. This corrects the correct print format for each of them and removes the build warning: warning: format `%d' expects argument of type `int', but argument 2 has type `long long unsigned int' Link: https://lkml.kernel.org/r/20240112071851.612930-1-usama.anjum@collabora.com Fixes: a4cb3b243343 ("selftests: mm: add a test for remapping to area immediately after existing mapping") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling	Sidhartha Kumar
	has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25	readahead: avoid multiple marked readahead pages	Jan Kara
	ra_alloc_folio() marks a page that should trigger next round of async readahead. However it rounds up computed index to the order of page being allocated. This can however lead to multiple consecutive pages being marked with readahead flag. Consider situation with index == 1, mark == 1, order == 0. We insert order 0 page at index 1 and mark it. Then we bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page at index 2 is marked as well. Then we bump order to 2, index is incremented to 4, mark gets rounded to 4 so page at index 4 is marked as well. The fact that multiple pages get marked within a single readahead window confuses the readahead logic and results in readahead window being trimmed back to 1. This situation is triggered in particular when maximum readahead window size is not a power of two (in the observed case it was 768 KB) and as a result sequential read throughput suffers. Fix the problem by rounding 'mark' down instead of up. Because the index is naturally aligned to 'order', we are guaranteed 'rounded mark' == index iff 'mark' is within the page we are allocating at 'index' and thus exactly one page is marked with readahead flag as required by the readahead code and sequential read performance is restored. This effectively reverts part of commit b9ff43dd2743 ("mm/readahead: Fix readahead with large folios"). The commit changed the rounding with the rationale: "... we were setting the readahead flag on the folio which contains the last byte read from the block. This is wrong because we will trigger readahead at the end of the read without waiting to see if a subsequent read is going to use the pages we just read." Although this is true, the fact is this was always the case with read sizes not aligned to folio boundaries and large folios in the page cache just make the situation more obvious (and frequent). Also for sequential read workloads it is better to trigger the readahead earlier rather than later. It is true that the difference in the rounding and thus earlier triggering of the readahead can result in reading more for semi-random workloads. However workloads really suffering from this seem to be rare. In particular I have verified that the workload described in commit b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") of reading random 100k blocks from a file like: [reader] bs=100k rw=randread numjobs=1 size=64g runtime=60s is not impacted by the rounding change and achieves ~70MB/s in both cases. [jack@suse.cz: fix one more place where mark rounding was done as well] Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@suse.cz Fixes: b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") Signed-off-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Guo Xuenan <guoxuenan@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>