linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2024-11-06	perf: Switch back to struct platform_driver::remove()	Uwe Kleine-König
	After commit 0edb555a65d1 ("platform: Make platform_driver::remove() return void") .remove() is (again) the right callback to implement for platform drivers. Convert all platform drivers below drivers/perf to use .remove(), with the eventual goal to drop struct platform_driver::remove_new(). As .remove() and .remove_new() have the same prototypes, conversion is done by just changing the structure member name in the driver initializer. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Link: https://lore.kernel.org/r/20241027180313.410964-2-u.kleine-koenig@baylibre.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-06	m68k: Make sure NR_IRQS is never zero	Geert Uytterhoeven
	When compiling an MMU kernel without any platform support, e.g. "MMU=y allnoconfig": echo CONFIG_MMU=y > allno.config make KCONFIG_ALLCONFIG=1 allnoconfig NR_IRQS is zero, causing a build failure: kernel/irq/irqdesc.c:593:3: error: array index in initializer exceeds array bounds 593 \| [0 ... NR_IRQS-1] = { \| ^ kernel/irq/irqdesc.c:593:3: note: (near initialization for ‘irq_desc’) kernel/irq/irqdesc.c:593:22: warning: excess elements in array initializer 593 \| [0 ... NR_IRQS-1] = { \| ^ kernel/irq/irqdesc.c:593:22: note: (near initialization for ‘irq_desc’) Fix this by setting the default value of NR_IRQS to 8, which is the mininum number of interrupts on any m68k CPU (the DIP48 variant of the MC68008 has tied IPL0 and IPL2 together, but that does not impact the range). Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/71db70dc27b46e5326518cd23cb94fa6bee172fe.1730805216.git.geert@linux-m68k.org
2024-11-06	m68k: Select M68020 as fallback for classic	Arnd Bergmann
	Building without a CPU selected does not work. Change the Kconfig logic slightly to make sure we always pick M68020 if nothing else is enabled. There are still some intentional warnings for builds with all platforms disabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20241030195638.22542-2-arnd@kernel.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	m68k: Move Sun 3 into a top-level platform option	Arnd Bergmann
	It is possible to select an m68k MMU build but not actually enable any of the three MMU options, which then results in a build failure: arch/m68k/include/asm/page.h:10:25: error: 'CONFIG_PAGE_SHIFT' undeclared here (not in a function); did you mean 'CONFIG_LOG_BUF_SHIFT'? Change the Kconfig selection to ensure that exactly one of the three options is always enabled whenever an MMU-enabled kernel is built, but moving CONFIG_SUN3 into a top-level option next to M68KCLASSIC and COLDFIRE. All defconfig files should keep working without changes, but alldefconfig now builds support for the classic MMU. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408032138.P7sBvIns-lkp@intel.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20241030195638.22542-1-arnd@kernel.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	m68k: kernel: Use str_read_write() helper function	Thorsten Blum
	Remove hard-coded strings by using the str_read_write() helper function and remove some unnecessary negations. Compile-tested only. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20241020205758.332095-1-thorsten.blum@linux.dev Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	m68k: Initialize jump labels early during setup_arch()	Jean-Michel Hautbois
	When using static keys early, e.g. by specifying "thread_backlog_napi" on the kernel command line: WARNING: CPU: 0 PID: 0 at include/linux/jump_label.h:322 setup_backlog_napi_threads+0x40/0xa0 static_key_enable(): static key '0x5ceec0' used before call to jump_label_init() The function jump_label_init() should be called from setup_arch() very early for proper functioning of jump label support. Signed-off-by: Jean-Michel Hautbois <jeanmichel.hautbois@yoseli.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20241016-fix-jump-label-v1-1-eb74c5f68405@yoseli.org [geert: Add reproducer] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	m68k: mvme147: Fix SCSI controller IRQ numbers	Daniel Palmer
	Sometime long ago the m68k IRQ code was refactored and the interrupt numbers for SCSI controller on this board ended up wrong, and it hasn't worked since. The PCC adds 0x40 to the vector for its interrupts so they end up in the user interrupt range. Hence, the kernel number should be the kernel offset for user interrupt range + the PCC interrupt number. Fixes: 200a3d352cd5 ("[PATCH] m68k: convert VME irq code") Signed-off-by: Daniel Palmer <daniel@0x0f.com> Reviewed-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/0e7636a21a0274eea35bfd5d874459d5078e97cc.1727926187.git.fthain@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	m68k: mvme147: Make mvme147_sched_init() __init	Daniel Palmer
	mvme147_sched_init() is only used once at init time so it can be made __init and save a few bytes of memory. Signed-off-by: Daniel Palmer <daniel@0x0f.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/20240929025506.1212237-1-daniel@0x0f.com Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2024-11-06	USB: serial: qcserial: add support for Sierra Wireless EM86xx	Jack Wu
	Add support for Sierra Wireless EM86xx with USB-id 0x1199:0x90e5 and 0x1199:0x90e4. 0x1199:0x90e5 T: Bus=03 Lev=01 Prnt=01 Port=05 Cnt=01 Dev#= 14 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1199 ProdID=90e5 Rev= 5.15 S: Manufacturer=Sierra Wireless, Incorporated S: Product=Semtech EM8695 Mobile Broadband Adapter S: SerialNumber=004403161882339 C:* #Ifs= 6 Cfg#= 1 Atr=a0 MxPwr=500mA A: FirstIf#=12 IfCount= 2 Cls=02(comm.) Sub=0e Prot=00 I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=qcserial E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=usbfs E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=qcserial E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms I:* If#= 4 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none) E: Ad=85(I) Atr=03(Int.) MxPS= 64 Ivl=32ms I:* If#=12 Alt= 0 #EPs= 1 Cls=02(comm.) Sub=0e Prot=00 Driver=cdc_mbim E: Ad=87(I) Atr=03(Int.) MxPS= 64 Ivl=32ms I: If#=13 Alt= 0 #EPs= 0 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim I:* If#=13 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim E: Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms 0x1199:0x90e4 T: Bus=03 Lev=01 Prnt=01 Port=05 Cnt=01 Dev#= 16 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=1199 ProdID=90e4 Rev= 0.00 S: Manufacturer=Sierra Wireless, Incorporated S: SerialNumber=004403161882339 C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr= 2mA I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=10 Driver=qcserial E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms Signed-off-by: Jack Wu <wojackbb@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Johan Hovold <johan@kernel.org>
2024-11-06	arm64/sve: Discard stale CPU state when handling SVE traps	Mark Brown
	The logic for handling SVE traps manipulates saved FPSIMD/SVE state incorrectly, and a race with preemption can result in a task having TIF_SVE set and TIF_FOREIGN_FPSTATE clear even though the live CPU state is stale (e.g. with SVE traps enabled). This has been observed to result in warnings from do_sve_acc() where SVE traps are not expected while TIF_SVE is set: \| if (test_and_set_thread_flag(TIF_SVE)) \| WARN_ON(1); /* SVE access shouldn't have trapped / Warnings of this form have been reported intermittently, e.g. https://lore.kernel.org/linux-arm-kernel/CA+G9fYtEGe_DhY2Ms7+L7NKsLYUomGsgqpdBj+QwDLeSg=JhGg@mail.gmail.com/ https://lore.kernel.org/linux-arm-kernel/000000000000511e9a060ce5a45c@google.com/ The race can occur when the SVE trap handler is preempted before and after manipulating the saved FPSIMD/SVE state, starting and ending on the same CPU, e.g. \| void do_sve_acc(unsigned long esr, struct pt_regs regs) \| { \| // Trap on CPU 0 with TIF_SVE clear, SVE traps enabled \| // task->fpsimd_cpu is 0. \| // per_cpu_ptr(&fpsimd_last_state, 0) is task. \| \| ... \| \| // Preempted; migrated from CPU 0 to CPU 1. \| // TIF_FOREIGN_FPSTATE is set. \| \| get_cpu_fpsimd_context(); \| \| if (test_and_set_thread_flag(TIF_SVE)) \| WARN_ON(1); /* SVE access shouldn't have trapped */ \| \| sve_init_regs() { \| if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) { \| ... \| } else { \| fpsimd_to_sve(current); \| current->thread.fp_type = FP_STATE_SVE; \| } \| } \| \| put_cpu_fpsimd_context(); \| \| // Preempted; migrated from CPU 1 to CPU 0. \| // task->fpsimd_cpu is still 0 \| // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then: \| // - Stale HW state is reused (with SVE traps enabled) \| // - TIF_FOREIGN_FPSTATE is cleared \| // - A return to userspace skips HW state restore \| } Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set by calling fpsimd_flush_task_state() to detach from the saved CPU state. This ensures that a subsequent context switch will not reuse the stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the new state to be reloaded from memory prior to a return to userspace. Fixes: cccb78ce89c4 ("arm64/sve: Rework SVE access trap to convert state in registers") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20241030-arm64-fpsimd-foreign-flush-v1-1-bd7bd66905a2@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-11-06	platform/x86: thinkpad_acpi: Fix for ThinkPad's with ECFW showing incorrect ↵	Vishnu Sankar
	fan speed Fix for Thinkpad's with ECFW showing incorrect fan speed. Some models use decimal instead of hexadecimal for the speed stored in the EC registers. For example the rpm register will have 0x4200 instead of 0x1068, here the actual RPM is "4200" in decimal. Add a quirk to handle this. Signed-off-by: Vishnu Sankar <vishnuocv@gmail.com> Suggested-by: Mark Pearson <mpearson-lenovo@squebb.ca> Link: https://lore.kernel.org/r/20241105235505.8493-1-vishnuocv@gmail.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-11-06	Merge patch series "tmpfs: Casefold fixes"	Christian Brauner
	André Almeida <andrealmeid@igalia.com> says: After casefold support for tmpfs was merged into vfs tree, two warnings were reported and I also found a small fix in the code. Thanks Nathan Chancellor and Stephen Rothwell! * patches from https://lore.kernel.org/r/20241101164251.327884-1-andrealmeid@igalia.com: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name Link: https://lore.kernel.org/r/20241101164251.327884-1-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	tmpfs: Initialize sysfs during tmpfs init	André Almeida
	Instead of using fs_initcall(), initialize sysfs with the rest of the filesystem. This is the right way to do it because otherwise any error during tmpfs_sysfs_init() would get silently ignored. It's also useful if tmpfs' sysfs ever need to display runtime information. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-4-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	tmpfs: Fix type for sysfs' casefold attribute	André Almeida
	DEVICE_STRING_ATTR_RO should be only used by device drivers since it relies on `struct device` to use device_show_string() function. Using this with non device code led to a kCFI violation: > cat /sys/fs/tmpfs/features/casefold [ 70.558496] CFI failure at kobj_attr_show+0x2c/0x4c (target: device_show_string+0x0/0x38; expected type: 0xc527b809) Like the other filesystems, fix this by manually declaring the attribute using kobj_attribute() and writing a proper show() function. Also, leave macros for anyone that need to expand tmpfs sysfs' with more attributes. Fixes: 5132f08bd332 ("tmpfs: Expose filesystem features via sysfs") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/lkml/20241031051822.GA2947788@thelio-3990X/ Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-3-andrealmeid@igalia.com Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	libfs: Fix kernel-doc warning in generic_ci_validate_strict_name	André Almeida
	Fix the indentation of the return values from generic_ci_validate_strict_name() to properly render the comment and to address a `make htmldocs` warning: Documentation/filesystems/api-summary:14: include/linux/fs.h:3504: WARNING: Bullet list ends without a blank line; unexpected unindent. Fixes: 0e152beb5aa1 ("libfs: Create the helper function generic_ci_validate_strict_name()") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20241030162435.05425f60@canb.auug.org.au/ Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-2-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	x86/resctrl: Support Sub-NUMA cluster mode SNC6	Tony Luck
	Support Sub-NUMA cluster mode with 6 nodes per L3 cache (SNC6) on some Intel platforms. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20241031220213.17991-1-tony.luck@intel.com
2024-11-06	Merge tag 'fs-atomic_2024-11-05' of ↵	Christian Brauner
	ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into vfs.untorn.writes Snapshot of untorn@fs-atomic#ritesh.list_ext4-do-not-fallback-to-buffered-io-for-dio-atomic-write at Tue Nov 5 16:20:51 PST 2024 Link: https://lore.kernel.org/r/20241106-zerkleinern-verzweifeln-7ec8173c56ad@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	freevxfs: Replace one-element array with flexible array member	Thorsten Blum
	Replace the deprecated one-element array with a modern flexible array member in the struct vxfs_dirblk. Link: https://github.com/KSPP/linux/issues/79 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241103121707.102838-3-thorsten.blum@linux.dev Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06	vp_vdpa: fix id_table array not null terminated error	Xiaoguang Wang
	Allocate one extra virtio_device_id as null terminator, otherwise vdpa_mgmtdev_get_classes() may iterate multiple times and visit undefined memory. Fixes: ffbda8e9df10 ("vdpa/vp_vdpa : add vdpa tool support in vp_vdpa") Cc: stable@vger.kernel.org Suggested-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Signed-off-by: Xiaoguang Wang <lege.wang@jaguarmicro.com> Message-Id: <20241105133518.1494-1-lege.wang@jaguarmicro.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-11-06	virtio_pci: Fix admin vq cleanup by using correct info pointer	Feng Liu
	vp_modern_avq_cleanup() and vp_del_vqs() clean up admin vq resources by virtio_pci_vq_info pointer. The info pointer of admin vq is stored in vp_dev->admin_vq.info instead of vp_dev->vqs[]. Using the info pointer from vp_dev->vqs[] for admin vq causes a kernel NULL pointer dereference bug. In vp_modern_avq_cleanup() and vp_del_vqs(), get the info pointer from vp_dev->admin_vq.info for admin vq to clean up the resources. Also make info ptr as argument of vp_del_vq() to be symmetric with vp_setup_vq(). vp_reset calls vp_modern_avq_cleanup, and causes the Call Trace: ================================================================== BUG: kernel NULL pointer dereference, address:0000000000000000 ... CPU: 49 UID: 0 PID: 4439 Comm: modprobe Not tainted 6.11.0-rc5 #1 RIP: 0010:vp_reset+0x57/0x90 [virtio_pci] Call Trace: <TASK> ... ? vp_reset+0x57/0x90 [virtio_pci] ? vp_reset+0x38/0x90 [virtio_pci] virtio_reset_device+0x1d/0x30 remove_vq_common+0x1c/0x1a0 [virtio_net] virtnet_remove+0xa1/0xc0 [virtio_net] virtio_dev_remove+0x46/0xa0 ... virtio_pci_driver_exit+0x14/0x810 [virtio_pci] ================================================================== Fixes: 4c3b54af907e ("virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result") Signed-off-by: Feng Liu <feliu@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Message-Id: <20241024135406.81388-1-feliu@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-11-06	vDPA/ifcvf: Fix pci_read_config_byte() return code handling	Yuan Can
	ifcvf_init_hw() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. Convert the error check to plain non-zero check. Fixes: 5a2414bc454e ("virtio: Intel IFC VF driver for VDPA") Signed-off-by: Yuan Can <yuancan@huawei.com> Message-Id: <20241017013812.129952-1-yuancan@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Zhu Lingshan <lingshan.zhu@kernel.org>
2024-11-06	Fix typo in vringh_test.c	Shivam Chaudhary
	Corrected minor typo in tools/virtio/vringh_test.c: - Fixed "retreives" to "retrieves" Signed-off-by: Shivam Chaudhary <cvam0000@gmail.com> Message-Id: <20241008145204.478749-1-cvam0000@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-11-06	vdpa: solidrun: Fix UB bug with devres	Philipp Stanner
	In psnet_open_pf_bar() and snet_open_vf_bar() a string later passed to pcim_iomap_regions() is placed on the stack. Neither pcim_iomap_regions() nor the functions it calls copy that string. Should the string later ever be used, this, consequently, causes undefined behavior since the stack frame will by then have disappeared. Fix the bug by allocating the strings on the heap through devm_kasprintf(). Cc: stable@vger.kernel.org # v6.3 Fixes: 51a8f9d7f587 ("virtio: vdpa: new SolidNET DPU driver.") Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Closes: https://lore.kernel.org/all/74e9109a-ac59-49e2-9b1d-d825c9c9f891@wanadoo.fr/ Suggested-by: Andy Shevchenko <andy@kernel.org> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20241028074357.9104-3-pstanner@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-11-06	vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans	Hyunwoo Kim
	During loopback communication, a dangling pointer can be created in vsk->trans, potentially leading to a Use-After-Free condition. This issue is resolved by initializing vsk->trans to NULL. Cc: stable <stable@kernel.org> Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Signed-off-by: Wongi Lee <qwerty@theori.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Message-Id: <2024102245-strive-crib-c8d3@gregkh> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-11-06	KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to avoid ↵	Gautam Menghani
	spurious interrupts Running a L2 vCPU (see [1] for terminology) with LPCR_MER bit set and no pending interrupts results in that L2 vCPU getting an infinite flood of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets the LPCR_MER bit if there are pending interrupts. The spurious flood problem can be observed in 2 cases: 1. Crashing the guest while interrupt heavy workload is running a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm) b. While the workload is running, crash the guest (make sure kdump is configured) c. Any one of the vCPUs of the guest will start getting an infinite flood of spurious interrupts. 2. Running LTP stress tests in multiple guests at the same time a. Start 4 L2 guests. b. Start running LTP stress tests on all 4 guests at same time. c. In some time, any one/more of the vCPUs of any of the guests will start getting an infinite flood of spurious interrupts. The root cause of both the above issues is the same: 1. A NMI is sent to a running vCPU that has LPCR_MER bit set. 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE is called for all the registers. 3. When H_GUEST_GET_STATE is called for LPCR, the vcpu->arch.vcore->lpcr of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this new value is always used whenever that vCPU runs, regardless of whether there was a pending interrupt. 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external interrupt handler, and this cycle never ends. Fix the spurious flood by masking off the LPCR_MER bit before running a L2 vCPU to ensure that it is not set if there are no pending interrupts. [1] Terminology: 1. L0 : PAPR hypervisor running in HV mode 2. L1 : Linux guest (logical partition) running on top of L0 3. L2 : KVM guest running on top of L1 Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
2024-11-05	md/md-bitmap: Add missing destroy_work_on_stack()	Yuan Can
	This commit add missed destroy_work_on_stack() operations for unplug_work.work in bitmap_unplug_async(). Fixes: a022325ab970 ("md/md-bitmap: add a new helper to unplug bitmap asynchrously") Cc: stable@vger.kernel.org Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241105130105.127336-1-yuancan@huawei.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	Merge branch '100GbE' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-11-04 (ice, idpf, i40e, e1000e) For ice: Marcin adjusts ordering of calls in ice_eswitch_detach() to resolve a use after free issue. Mateusz corrects variable type for Flow Director queue to fix issues related to drop actions. For idpf: Pavan resolves issues related to reset on idpf; avoiding use of freed vport and correctly unrolling the mailbox task. For i40e: Aleksandr fixes a race condition involving addition and deletion of VF MAC filters. For e1000e: Vitaly reverts workaround for Meteor Lake causing regressions in power management flows. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: e1000e: Remove Meteor Lake SMBUS workarounds i40e: fix race condition by adding filter's intermediate sync state idpf: fix idpf_vc_core_init error path idpf: avoid vport access in idpf_get_link_ksettings ice: change q_index variable type to s16 to store -1 value ice: Fix use after free during unload with ports in bridge ==================== Link: https://patch.msgid.link/20241104223639.2801097-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05	Merge branch 'mptcp-pm-fix-wrong-perm-and-sock-kfree'	Jakub Kicinski
	Matthieu Baerts says: ==================== mptcp: pm: fix wrong perm and sock kfree Two small fixes related to the MPTCP path-manager: - Patch 1: remove an accidental restriction to admin users to list MPTCP endpoints. A regression from v6.7. - Patch 2: correctly use sock_kfree_s() instead of kfree() in the userspace PM. A fix for another fix introduced in v6.4 and backportable up to v5.19. ==================== Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-0-c13f2ff1656f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05	mptcp: use sock_kfree_s instead of kfree	Geliang Tang
	The local address entries on userspace_pm_local_addr_list are allocated by sock_kmalloc(). It's then required to use sock_kfree_s() instead of kfree() to free these entries in order to adjust the allocated size on the sk side. Fixes: 24430f8bf516 ("mptcp: add address into userspace pm list") Cc: stable@vger.kernel.org Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-2-c13f2ff1656f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05	mptcp: no admin perm to list endpoints	Matthieu Baerts (NGI0)
	During the switch to YNL, the command to list all endpoints has been accidentally restricted to users with admin permissions. It looks like there are no reasons to have this restriction which makes it harder for a user to quickly check if the endpoint list has been correctly populated by an automated tool. Best to go back to the previous behaviour then. mptcp_pm_gen.c has been modified using ynl-gen-c.py: $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \ --spec Documentation/netlink/specs/mptcp_pm.yaml --source \ -o net/mptcp/mptcp_pm_gen.c The header file doesn't need to be regenerated. Fixes: 1d0507f46843 ("net: mptcp: convert netlink from small_ops to ops") Cc: stable@vger.kernel.org Reviewed-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-1-c13f2ff1656f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05	net: phy: ti: add PHY_RST_AFTER_CLK_EN flag	Diogo Silva
	DP83848 datasheet (section 4.7.2) indicates that the reset pin should be toggled after the clocks are running. Add the PHY_RST_AFTER_CLK_EN to make sure that this indication is respected. In my experience not having this flag enabled would lead to, on some boots, the wrong MII mode being selected if the PHY was initialized on the bootloader and was receiving data during Linux boot. Signed-off-by: Diogo Silva <diogompaissilva@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Fixes: 34e45ad9378c ("net: phy: dp83848: Add TI DP83848 Ethernet PHY") Link: https://patch.msgid.link/20241102151504.811306-1-paissilva@ld-100007.ds1.internal Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05	mm: resolve faulty mmap_region() error path behaviour	Lorenzo Stoakes
	The mmap_region() function is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. Taking advantage of previous patches in this series we move a number of checks earlier in the code, simplifying things by moving the core of the logic into a static internal function __mmap_region(). Doing this allows us to perform a number of checks up front before we do any real work, and allows us to unwind the writable unmap check unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE validation unconditionally also. We move a number of things here: 1. We preallocate memory for the iterator before we call the file-backed memory hook, allowing us to exit early and avoid having to perform complicated and error-prone close/free logic. We carefully free iterator state on both success and error paths. 2. The enclosing mmap_region() function handles the mapping_map_writable() logic early. Previously the logic had the mapping_map_writable() at the point of mapping a newly allocated file-backed VMA, and a matching mapping_unmap_writable() on success and error paths. We now do this unconditionally if this is a file-backed, shared writable mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however doing so does not invalidate the seal check we just performed, and we in any case always decrement the counter in the wrapper. We perform a debug assert to ensure a driver does not attempt to do the opposite. 3. We also move arch_validate_flags() up into the mmap_region() function. This is only relevant on arm64 and sparc64, and the check is only meaningful for SPARC with ADI enabled. We explicitly add a warning for this arch if a driver invalidates this check, though the code ought eventually to be fixed to eliminate the need for this. With all of these measures in place, we no longer need to explicitly close the VMA on error paths, as we place all checks which might fail prior to a call to any driver mmap hook. This eliminates an entire class of errors, makes the code easier to reason about and more robust. Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Mark Brown <broonie@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling	Lorenzo Stoakes
	Currently MTE is permitted in two circumstances (desiring to use MTE having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is specified, as checked by arch_calc_vm_flag_bits() and actualised by setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap hook is activated in mmap_region(). The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also set is the arm64 implementation of arch_validate_flags(). Unfortunately, we intend to refactor mmap_region() to perform this check earlier, meaning that in the case of a shmem backing we will not have invoked shmem_mmap() yet, causing the mapping to fail spuriously. It is inappropriate to set this architecture-specific flag in general mm code anyway, so a sensible resolution of this issue is to instead move the check somewhere else. We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via the arch_calc_vm_flag_bits() call. This is an appropriate place to do this as we already check for the MAP_ANONYMOUS case here, and the shmem file case is simply a variant of the same idea - we permit RAM-backed memory. This requires a modification to the arch_calc_vm_flag_bits() signature to pass in a pointer to the struct file associated with the mapping, however this is not too egregious as this is only used by two architectures anyway - arm64 and parisc. So this patch performs this adjustment and removes the unnecessary assignment of VM_MTE_ALLOWED in shmem_mmap(). [akpm@linux-foundation.org: fix whitespace, per Catalin] Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Larsson <andreas@gaisler.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm: refactor map_deny_write_exec()	Lorenzo Stoakes
	Refactor the map_deny_write_exec() to not unnecessarily require a VMA parameter but rather to accept VMA flags parameters, which allows us to use this function early in mmap_region() in a subsequent commit. While we're here, we refactor the function to be more readable and add some additional documentation. Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm: unconditionally close VMAs on error	Lorenzo Stoakes
	Incorrect invocation of VMA callbacks when the VMA is no longer in a consistent state is bug prone and risky to perform. With regards to the important vm_ops->close() callback We have gone to great lengths to try to track whether or not we ought to close VMAs. Rather than doing so and risking making a mistake somewhere, instead unconditionally close and reset vma->vm_ops to an empty dummy operations set with a NULL .close operator. We introduce a new function to do so - vma_close() - and simplify existing vms logic which tracked whether we needed to close or not. This simplifies the logic, avoids incorrect double-calling of the .close() callback and allows us to update error paths to simply call vma_close() unconditionally - making VMA closure idempotent. Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm: avoid unsafe VMA hook invocation when error arises on mmap hook	Lorenzo Stoakes
	Patch series "fix error handling in mmap_region() and refactor (hotfixes)", v4. mmap_region() is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. The patches in this series comprise the minimal changes required to resolve existing issues in mmap_region() error handling, in order that they can be hotfixed and backported. There is additionally a follow up series which goes further, separated out from the v1 series and sent and updated separately. This patch (of 5): After an attempted mmap() fails, we are no longer in a situation where we can safely interact with VMA hooks. This is currently not enforced, meaning that we need complicated handling to ensure we do not incorrectly call these hooks. We can avoid the whole issue by treating the VMA as suspect the moment that the file->f_ops->mmap() function reports an error by replacing whatever VMA operations were installed with a dummy empty set of VMA operations. We do so through a new helper function internal to mm - mmap_file() - which is both more logically named than the existing call_mmap() function and correctly isolates handling of the vm_op reassignment to mm. All the existing invocations of call_mmap() outside of mm are ultimately nested within the call_mmap() from mm, which we now replace. It is therefore safe to leave call_mmap() in place as a convenience function (and to avoid churn). The invokers are: ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap() coda_file_operations -> mmap -> coda_file_mmap() shm_file_operations -> shm_mmap() shm_file_operations_huge -> shm_mmap() dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops -> i915_gem_dmabuf_mmap() None of these callers interact with vm_ops or mappings in a problematic way on error, quickly exiting out. Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm/thp: fix deferred split unqueue naming and locking	Hugh Dickins
	Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	mm/thp: fix deferred split queue not partially_mapped	Hugh Dickins
	Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. The new unlocked list_del_init() in deferred_split_scan() is buggy. I gave bad advice, it looks plausible since that's a local on-stack list, but the fact is that it can race with a third party freeing or migrating the preceding folio (properly unqueueing it with refcount 0 while holding split_queue_lock), thereby corrupting the list linkage. The obvious answer would be to take split_queue_lock there: but it has a long history of contention, so I'm reluctant to add to that. Instead, make sure that there is always one safe (raised refcount) folio before, by delaying its folio_put(). (And of course I was wrong to suggest updating split_queue_len without the lock: leave that until the splice.) And remove two over-eager partially_mapped checks, restoring those tests to how they were before: if uncharge_folio() or free_tail_page_prepare() finds _deferred_list non-empty, it's in trouble whether or not that folio is partially_mapped (and the flag was already cleared in the latter case). Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05	ext4: Do not fallback to buffered-io for DIO atomic write	Ritesh Harjani (IBM)
	atomic writes is currently only supported for single fsblock and only for direct-io. We should not return -ENOTBLK for atomic writes since we want the atomic write request to either complete fully or fail otherwise. Hence, we should never fallback to buffered-io in case of DIO atomic write requests. Let's also catch if this ever happens by adding some WARN_ON_ONCE before buffered-io handling for direct-io atomic writes. More details of the discussion [1]. While at it let's add an inline helper ext4_want_directio_fallback() which simplifies the logic checks and inherently fixes condition on when to return -ENOTBLK which otherwise was always returning true for any write or directio in ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap. [1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04 Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05	ext4: Support setting FMODE_CAN_ATOMIC_WRITE	Ritesh Harjani (IBM)
	FS needs to add the fmode capability in order to support atomic writes during file open (refer kiocb_set_rw_flags()). Set this capability on a regular file if ext4 can do atomic write. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05	ext4: Check for atomic writes support in write iter	Ritesh Harjani (IBM)
	Let's validate the given constraints for atomic write request. Otherwise it will fail with -EINVAL. Currently atomic write is only supported on DIO, so for buffered-io it will return -EOPNOTSUPP. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05	ext4: Add statx support for atomic writes	Ritesh Harjani (IBM)
	This patch adds base support for atomic writes via statx getattr. On bs < ps systems, we can create FS with say bs of 16k. That means both atomic write min and max unit can be set to 16k for supporting atomic writes. Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05	md/raid5: don't set Faulty rdev for blocked_rdev	Yu Kuai
	Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()	Yu Kuai
	Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md/raid1: don't wait for Faulty rdev in wait_blocked_rdev()	Yu Kuai
	Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md/raid1: factor out helper to handle blocked rdev from raid1_write_request()	Yu Kuai
	Currently raid1 is preparing IO for underlying disks while checking if any disk is blocked, if so allocated resources must be released, then waiting for rdev to be unblocked and try to prepare IO again. Make code cleaner by checking blocked rdev first, it doesn't matter if rdev is blocked while issuing IO, the IO will wait for rdev to be unblocked or not. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md: don't record new badblocks for faulty rdev	Yu Kuai
	Faulty will be checked before issuing IO to the rdev, however, rdev can be faulty at any time, hence it's possible that rdev_set_badblocks() will be called for faulty rdev. In this case, mddev->sb_flags will be set and some other path can be blocked by updating super block. Since faulty rdev will not be accesed anymore, there is no need to record new babblocks for faulty rdev and forcing updating super block. Noted this is not a bugfix, just prevent updating superblock in some corner cases, and will help to slice a bug related to external metadata[1], testing also shows that devices are removed faster in the case IO error. [1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md: don't wait faulty rdev in md_wait_for_blocked_rdev()	Yu Kuai
	md_wait_for_blocked_rdev() is called for write IO while rdev is blocked, howerver, rdev can be faulty after choosing this rdev to write, and faulty rdev should never be accessed anymore, hence there is no point to wait for faulty rdev to be unblocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md: add a new helper rdev_blocked()	Yu Kuai
	The helper will be used in later patches for raid1/raid10/raid5, the difference is that Faulty rdev with unacknowledged bad block will not be considered blocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-05	md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()	Uros Bizjak
	Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20241007084831.48067-1-ubizjak@gmail.com Signed-off-by: Song Liu <song@kernel.org>