linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-05-09	selftests/timens: timerfd: Use correct clockid type in tclock_gettime()	Thomas Weißschuh
	tclock_gettime() is a wrapper around clock_gettime(). The first parameter of clock_gettime() is of type "clockid_t", not "clock_t". Use the correct type instead. Link: https://lore.kernel.org/r/20250502-selftests-timens-fixes-v1-3-fb517c76f04d@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	selftests/timens: Make run_tests() functions static	Thomas Weißschuh
	These functions are never used outside their defining compilation unit and can be made static. Link: https://lore.kernel.org/r/20250502-selftests-timens-fixes-v1-2-fb517c76f04d@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	selftests/timens: Print TAP headers	Thomas Weißschuh
	The TAP specification requires that the output begins with a header line. These headers lines are missing in the timens tests. Print such a line. Link: https://lore.kernel.org/r/20250502-selftests-timens-fixes-v1-1-fb517c76f04d@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	selftests: pid_namespace: add missing sys/mount.h include in pid_max.c	Peter Seiderer
	Fix compile on openSUSE Tumbleweed (gcc-14.2.1, glibc-2.40): - add missing sys/mount.h include Fixes: pid_max.c: In function ‘pid_max_cb’: pid_max.c:42:15: error: implicit declaration of function ‘mount’ [-Wimplicit-function-declaration] 42 \| ret = mount("", "/", NULL, MS_PRIVATE \| MS_REC, 0); \| ^~~~~ Link: https://lore.kernel.org/r/20250115105211.390370-3-ps.report@gmx.net Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-10	Merge tag 'drm-intel-fixes-2025-05-09' of ↵	Dave Airlie
	https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes drm/i915 fixes for v6.15-rc6: - Fix oops on resume after disconnecting DP MST sinks during suspend - Fix SPLC num_waiters refcounting Signed-off-by: Dave Airlie <airlied@redhat.com> From: Jani Nikula <jani.nikula@intel.com> Link: https://lore.kernel.org/r/87tt5umeaw.fsf@intel.com
2025-05-10	Merge tag 'drm-xe-fixes-2025-05-09' of ↵	Dave Airlie
	https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Prevent PF queue overflow - Hold all forcewake during mocs test - Remove GSC flush on reset path - Fix forcewake put on error path - Fix runtime warning when building without svm Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/jffqa56f2zp4i5ztz677cdspgxhnw7qfop3dd3l2epykfpfvza@q2nw6wapsphz
2025-05-09	drm/meson: Use 1000ULL when operating with mode->clock	I Hsin Cheng
	Coverity scan reported the usage of "mode->clock * 1000" may lead to integer overflow. Use "1000ULL" instead of "1000" when utilizing it to avoid potential integer overflow issue. Link: https://scan5.scan.coverity.com/#/project-view/10074/10063?selectedIssue=1646759 Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Fixes: 1017560164b6 ("drm/meson: use unsigned long long / Hz for frequency types") Link: https://lore.kernel.org/r/20250505184338.678540-1-richard120310@gmail.com Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
2025-05-09	kselftest: cpufreq: Get rid of double suspend in rtcwake case	Nícolas F. R. A. Prado
	Commit 0b631ed3ce92 ("kselftest: cpufreq: Add RTC wakeup alarm") added support for automatic wakeup in the suspend routine of the cpufreq kselftest by using rtcwake, however it left the manual power state change in the common path. The end result is that when running the cpufreq kselftest with '-t suspend_rtc' or '-t hibernate_rtc', the system will go to sleep and be woken up by the RTC, but then immediately go to sleep again with no wakeup programmed, so it will sleep forever in an automated testing setup. Fix this by moving the manual power state change so that it only happens when not using rtcwake. Link: https://lore.kernel.org/r/20250430-ksft-cpufreq-suspend-rtc-double-fix-v1-1-dc17a729c5a7@collabora.com Fixes: 0b631ed3ce92 ("kselftest: cpufreq: Add RTC wakeup alarm") Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	selftests/cpufreq: Fix cpufreq basic read and update testcases	Swapnil Sapkal
	In cpufreq basic selftests, one of the testcases is to read all cpufreq sysfs files and print the values. This testcase assumes all the cpufreq sysfs files have read permissions. However certain cpufreq sysfs files (eg. stats/reset) are write only files and this testcase errors out when it is not able to read the file. Similarily, there is one more testcase which reads the cpufreq sysfs file data and write it back to same file. This testcase also errors out for sysfs files without read permission. Fix these testcases by adding proper read permission checks. Link: https://lore.kernel.org/r/20250430171433.10866-1-swapnil.sapkal@amd.com Reported-by: Narasimhan V <narasimhan.v@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	selftests/ftrace: Convert poll to a gen_file	Ayush Jain
	Poll program is a helper to ftracetest, thus make it a generic file and remove it from being run as a test. Currently when executing tests using $ make run_tests CC poll TAP version 13 1..2 # timeout set to 0 # selftests: ftrace: poll # Error: Polling file is not specified not ok 1 selftests: ftrace: poll # exit=255 Fix this by using TEST_GEN_FILES to build the 'poll' binary as a helper rather than as a test. Fixes: 80c3e28528ff ("selftests/tracing: Add hist poll() support test") Link: https://lore.kernel.org/r/20250409044632.363285-1-Ayush.jain3@amd.com Signed-off-by: Ayush Jain <Ayush.jain3@amd.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-05-09	genirq: Fix inverted condition in handle_nested_irq()	Thomas Gleixner
	Marek reported that the rework of handle_nested_irq() introduced a inverted condition, which prevents handling of interrupts. Fix it up. Fixes: 2ef2e13094c7 ("genirq/chip: Rework handle_nested_irq()") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Closes: https://lore.kernel/org/all/46ed4040-ca11-4157-8bd7-13c04c113734@samsung.com
2025-05-09	Merge tag 'arm64-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Move the arm64_use_ng_mappings variable from the .bss to the .data section as it is accessed very early during boot with the MMU off and before the .bss has been initialised. This could lead to incorrect idmap page table" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
2025-05-09	Merge tag 'riscv-for-linus-6.15-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - The compressed half-word misaligned access instructions (c.lhu, c.lh, and c.sh) from the Zcb extension are now properly emulated - A series of fixes to properly emulate permissions while handling userspace misaligned accesses - A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the envcfg CSR on systems that don't support that CSR, and to report those failures up to userspace - The .rela.dyn section is no longer stripped from vmlinux, as it is necessary to relocate the kernel under some conditions (including kexec) * tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm scripts: Do not strip .rela.dyn section riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL riscv: misaligned: use get_user() instead of __get_user() riscv: misaligned: enable IRQs while handling misaligned accesses riscv: misaligned: factorize trap handling riscv: misaligned: Add handling for ZCB instructions
2025-05-09	cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasks	Waiman Long
	Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") enabled us to pull CPUs dedicated to child partitions from tasks in top_cpuset by ignoring per cpu kthreads. However, there can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY flag set to indicate that we shouldn't mess with their CPU affinity. For other kthreads, their affinity will be changed to skip CPUs dedicated to child partitions whether it is an isolating or a scheduling one. As all the per cpu kthreads have PF_NO_SETAFFINITY set, the PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads. Fix this issue by dropping the kthread_is_per_cpu() check and checking the PF_NO_SETAFFINITY flag instead. Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-09	Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux	Linus Torvalds
	Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter
2025-05-09	Merge tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux	Linus Torvalds
	Pull io_uring fixes from Jens Axboe: - Fix for linked timeouts arming and firing wrt prep and issue of the request being managed by the linked timeout - Fix for a CQE ordering issue between requests with multishot and using the same buffer group. This is a dumbed down version for this release and for stable, it'll get improved for v6.16 - Tweak the SQPOLL submit batch size. A previous commit made SQPOLL manage its own task_work and chose a tiny batch size, bump it from 8 to 32 to fix a performance regression due to that * tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux: io_uring/sqpoll: Increase task_work submission batch size io_uring: ensure deferred completions are flushed for multishot io_uring: always arm linked timeouts prior to issue
2025-05-09	Merge tag 'modules-6.15-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules fix from Petr Pavlu: "A single fix to prevent use of an uninitialized completion pointer when releasing a module_kobject in specific situations. This addresses a latent bug exposed by commit f95bbfe18512 ("drivers: base: handle module_kobject creation")" * tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: ensure that kobject_put() is safe for module type kobjects
2025-05-09	Merge tag 'asahi-soc-fixes-6.15' of https://github.com/AsahiLinux/linux into ↵	Arnd Bergmann
	arm/fixes Apple SoC fixes for 6.15 This tag contains two small commits since rc1: - Add a .mailmap entry requested by Asahi Lina to better filter her emails - Mark the power domains for the touchbar support introduced with 6.15 as always on since the driver cannot initialize the touchbar from scratch after the domains are powered off (e.g. during suspend). * tag 'asahi-soc-fixes-6.15' of https://github.com/AsahiLinux/linux: arm64: dts: apple: touchbar: Mark ps_dispdfr_be as always-on mailmap: Update email for Asahi Lina Link: https://lore.kernel.org/r/20250423145047.3098-1-sven@svenpeter.dev Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09	Merge tag 'riscv-sophgo-dt-fixes-for-v6.15-rc1' of ↵	Arnd Bergmann
	https://github.com/sophgo/linux into arm/fixes RISC-V Sophgo Devicetree fixes for v6.15-rc1 Just one minor fix to correct DMA data-width configuration for CV18xx. Signed-off-by: Chen Wang <unicorn_wang@outlook.com> * tag 'riscv-sophgo-dt-fixes-for-v6.15-rc1' of https://github.com/sophgo/linux: riscv: dts: sophgo: fix DMA data-width configuration for CV18xx Link: https://lore.kernel.org/r/MA0P287MB2262454C19B8899BC1694D04FE832@MA0P287MB2262.INDP287.PROD.OUTLOOK.COM Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09	Merge tag 'amlogic-fixes-for-v6.15' of ↵	Arnd Bergmann
	https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux into arm/fixes Amlogic Fixes for v6.15: - fix reference to unknown/untested PWM clock on ARM/ARM64 boards - fix missing clkc_audio node on dreambox ARM64 DT * tag 'amlogic-fixes-for-v6.15' of https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux: arm64: dts: amlogic: dreambox: fix missing clkc_audio node arm64: dts: amlogic: g12: fix reference to unknown/untested PWM clock arm64: dts: amlogic: gx: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8b: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8: fix reference to unknown/untested PWM clock Link: https://lore.kernel.org/r/e9c520a1-b986-49e1-b9b1-67511c187716@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09	Merge tag 'v6.15-rockchip-dtsfixes1' of ↵	Arnd Bergmann
	https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes Removal of operating-points above what the rk3588j soc is rated for, and a number of smaller fixes: Turing RK1 fan can spin down again, fixed pins, pinmuxing and clocks and some devicetree-correctnes improvements. * tag 'v6.15-rockchip-dtsfixes1' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: arm64: dts: rockchip: fix Sige5 RTC interrupt pin arm64: dts: rockchip: Assign RT5616 MCLK rate on rk3588-friendlyelec-cm3588 arm64: dts: rockchip: Align wifi node name with bindings in CB2 arm64: dts: rockchip: Fix mmc-pwrseq clock name on rock-pi-4 arm64: dts: rockchip: Use "regulator-fixed" for btreg on px30-engicam for vcc3v3-btreg arm64: dts: rockchip: Add pinmuxing for eMMC on QNAP TS433 arm64: dts: rockchip: Remove overdrive-mode OPPs from RK3588J SoC dtsi arm64: dts: rockchip: Allow Turing RK1 cooling fan to spin down Link: https://lore.kernel.org/r/2923598.88bMQJbFj6@diego Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09	x86/mm: Eliminate window where TLB flushes may be inadvertently skipped	Dave Hansen
	tl;dr: There is a window in the mm switching code where the new CR3 is set and the CPU should be getting TLB flushes for the new mm. But should_flush_tlb() has a bug and suppresses the flush. Fix it by widening the window where should_flush_tlb() sends an IPI. Long Version: === History === There were a few things leading up to this. First, updating mm_cpumask() was observed to be too expensive, so it was made lazier. But being lazy caused too many unnecessary IPIs to CPUs due to the now-lazy mm_cpumask(). So code was added to cull mm_cpumask() periodically[2]. But that culling was a bit too aggressive and skipped sending TLB flushes to CPUs that need them. So here we are again. === Problem === The too-aggressive code in should_flush_tlb() strikes in this window: // Turn on IPIs for this CPU/mm combination, but only // if should_flush_tlb() agrees: cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); load_new_mm_cr3(need_flush); // ^ After 'need_flush' is set to false, IPIs MUST // be sent to this CPU and not be ignored. this_cpu_write(cpu_tlbstate.loaded_mm, next); // ^ Not until this point does should_flush_tlb() // become true! should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3() and writing to 'loaded_mm', which is a window where they should not be suppressed. Whoops. === Solution === Thankfully, the fuzzy "just about to write CR3" window is already marked with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in should_flush_tlb() is sufficient to ensure that the CPU is targeted with an IPI. This will cause more TLB flush IPIs. But the window is relatively small and I do not expect this to cause any kind of measurable performance impact. Update the comment where LOADED_MM_SWITCHING is written since it grew yet another user. Peter Z also raised a concern that should_flush_tlb() might not observe 'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off() writes them. Add a barrier to ensure that they are observed in the order they are written. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1] Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <sdolan@janestreet.com> Cc: stable@vger.kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-09	arm64: dts: imx8mp-var-som: Fix LDO5 shutdown causing SD card timeout	Himanshu Bhavani
	Fix SD card timeout issue caused by LDO5 regulator getting disabled after boot. The kernel log shows LDO5 being disabled, which leads to a timeout on USDHC2: [ 33.760561] LDO5: disabling [ 81.119861] mmc1: Timeout waiting for hardware interrupt. To prevent this, set regulator-boot-on and regulator-always-on for LDO5. Also add the vqmmc regulator to properly support 1.8V/3.3V signaling for USDHC2 using a GPIO-controlled regulator. Fixes: 6c2a1f4f71258 ("arm64: dts: imx8mp-var-som-symphony: Add Variscite Symphony board and VAR-SOM-MX8MP SoM") Signed-off-by: Himanshu Bhavani <himanshu.bhavani@siliconsignals.io> Acked-by: Tarang Raval <tarang.raval@siliconsignals.io> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2025-05-09	io_uring: count allocated requests	Pavel Begunkov
	Keep track of the number requests a ring currently has allocated (and not freed), it'll be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: open code io_account_cq_overflow()	Pavel Begunkov
	io_account_cq_overflow() doesn't help explaining what's going on in there, and it'll become even smaller with following patches, so open code it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: consolidate drain seq checking	Pavel Begunkov
	We check sequences when queuing drained requests as well when flushing them. Instead, always queue and immediately try to flush, so that all seq handling can be kept contained in the flushing code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: remove drain prealloc checks	Pavel Begunkov
	Currently io_drain_req() has two steps. The first is fast path checking sequence numbers. The second is allocations, rechecking and actual queuing. Further simplify it by removing the first step. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: simplify drain ret passing	Pavel Begunkov
	"ret" in io_drain_req() is only used in one place, remove it and pass -ENOMEM directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: fix spurious drain flushing	Pavel Begunkov
	io_queue_deferred() is not tolerant to spurious calls not completing some requests. You can have an inflight drain-marked request and another request that came after and got queued into the drain list. Now, if io_queue_deferred() is called before the first request completes, it'll check the 2nd req with req_need_defer(), find that there is no drain flag set, and queue it for execution. To make io_queue_deferred() work, it should at least check sequences for the first request, and then we need also need to check if there is another drain request creating another bubble. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: account drain memory to cgroup	Pavel Begunkov
	Account drain allocations against memcg. It's not a big problem as each such allocation is paired with a request, which is accounted, but it's nicer to follow the limits more closely. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: add lockdep asserts to io_add_aux_cqe	Pavel Begunkov
	io_add_aux_cqe() can only be called for rings with uring_lock protected completion queues, add a couple of assertions in regards to that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring/net: move CONFIG_NET guards to Makefile	Pavel Begunkov
	Instruct Makefile to never try to compile net.c without CONFIG_NET and kill ifdefs in the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring: update parameter name in io_pin_pages function declaration	Long Li
	Rename first parameter in io_pin_pages from ubuf to uaddr for consistency between declaration and implementation. Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	io_uring/sqpoll: Increase task_work submission batch size	Gabriel Krisman Bertazi
	Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio. There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer. My benchmark is: fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio: Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82 If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue. In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7 After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32 Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09	Merge branch 'net_sched-gso_skb-flushing'	David S. Miller
	Cong Wang says: ==================== net_sched: Fix gso_skb flushing during qdisc change This patchset contains a bug fix and its test cases, please check each patch description for more details. To keep the bug fix minimum, I intentionally limit the code changes to the cases reported here. --- v2: added a missing qlen-- fixed the new boolean parameter for two qdiscs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09	selftests/tc-testing: Add qdisc limit trimming tests	Cong Wang
	Added new test cases for FQ, FQ_CODEL, FQ_PIE, and HHF qdiscs to verify queue trimming behavior when the qdisc limit is dynamically reduced. Each test injects packets, reduces the qdisc limit, and checks that the new limit is enforced. This is still best effort since timing qdisc backlog is not easy. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09	net_sched: Flush gso_skb list too during ->change()	Cong Wang
	Previously, when reducing a qdisc's limit via the ->change() operation, only the main skb queue was trimmed, potentially leaving packets in the gso_skb list. This could result in NULL pointer dereference when we only check sch->limit against sch->q.qlen. This patch introduces a new helper, qdisc_dequeue_internal(), which ensures both the gso_skb list and the main queue are properly flushed when trimming excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie) are updated to use this helper in their ->change() routines. Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM") Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme") Reported-by: Will <willsroot@protonmail.com> Reported-by: Savy <savy@syst3mfailure.io> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09	Merge patch series "Minor namespace code simplication"	Christian Brauner
	Joel Savitz <jsavitz@redhat.com> says: The two patches are independent of each other. The first patch removes unnecssary NULL guards from free_nsproxy() and create_new_namespaces() in line with other usage of the put__ns() call sites. The second patch slightly reduces the size of the kernel when CONFIG_CGROUPS is not selected. patches from https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com: include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards Link: https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	include/cgroup: separate {get,put}_cgroup_ns no-op case	Joel Savitz
	When CONFIG_CGROUPS is not selected, {get,put}_cgroup_ns become no-ops and therefore it is not necessary to compile in the code for changing the reference count. When CONFIG_CGROUP is selected, there is no valid case where either of {get,put}_cgroup_ns() will be called with a NULL argument. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/20250508184930.183040-3-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	kernel/nsproxy: remove unnecessary guards	Joel Savitz
	In free_nsproxy() and the error path of create_new_namesapces() the put__ns() calls are guarded by unnecessary NULL checks. put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never receive a NULL argument unless their namespace type is disabled, and in this case all four become no-ops at compile time anyway. put_mnt_ns() will never receive a null argument at any time. This unguarded usage is in line with other call sites of put__ns(). Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	f2fs: fix freezing filesystem during resize	Christian Brauner
	Using FREEZE_HOLDER_USERSPACE has two consequences: (1) If userspace freezes the filesystem after mnt_drop_write_file() but before freeze_super() was called filesystem resizing will fail because the freeze isn't marked as nestable. (2) If the kernel has successfully frozen the filesystem via FREEZE_HOLDER_USERSPACE userspace can simply undo it by using the FITHAW ioctl. Fix both issues by using FREEZE_HOLDER_KERNEL. It will nest with FREEZE_HOLDER_USERSPACE and cannot be undone by userspace. And it is the correct thing to do because the kernel temporarily freezes the filesystem. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	Merge patch series "power: wire-up filesystem freeze/thaw with suspend/resume"	Christian Brauner
	Christian Brauner <brauner@kernel.org> says: Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. Othwerwise it risks thawing filesystems it didn't own. This could be done differently be e.g., keeping the filesystems that were actually frozen on a list and then unfreezing them from that list. This is disgustingly unclean though and reeks of an ugly hack. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. * patches from https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org: kernfs: add warning about implementing freeze/thaw power: freeze filesystems during suspend/resume Link: https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	Merge patch series "efivarfs: support freeze/thaw"	Christian Brauner
	Christian Brauner <brauner@kernel.org> says: Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. This works without any big issues and congrats afaict efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care. * patches from https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org: efivarfs: support freeze/thaw libfs: export find_next_child() Link: https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	kernfs: add warning about implementing freeze/thaw	Christian Brauner
	Sysfs is built on top of kernfs and sysfs provides the power management infrastructure to support suspend/hibernate by writing to various files in /sys/power/. As filesystems may be automatically frozen during suspend/hibernate implementing freeze/thaw support for kernfs generically will cause deadlocks as the suspending/hibernation initiating task will hold a VFS lock that it will then wait upon to be released. If freeze/thaw for kernfs is needed talk to the VFS. Link: https://lore.kernel.org/r/20250402-work-freeze-v2-4-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	efivarfs: support freeze/thaw	Christian Brauner
	Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. This works without any big issues and congrats afaict efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. We really really don't need to care. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-2-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	power: freeze filesystems during suspend/resume	Christian Brauner
	Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. We add a new sysctl know /sys/power/freeze_filesystems that will allow userspace to freeze filesystems during suspend/hibernate. For now it defaults to off. The thaw logic doesn't require checking whether freezing is enabled because the power subsystem exclusively owns frozen filesystems for the duration of suspend/hibernate and is able to skip filesystems it doesn't need to freeze. Also it is technically possible that filesystem filesystem_freeze_enabled is true and power freezes the filesystems but before freezing all processes another process disables filesystem_freeze_enabled. If power were to place the filesystems_thaw() call under filesystems_freeze_enabled it would fail to thaw the fileystems it frozw. The exclusive holder mechanism makes it possible to iterate through the list without any concern making sure that no filesystems are left frozen. Link: https://lore.kernel.org/r/20250402-work-freeze-v2-3-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	libfs: export find_next_child()	Christian Brauner
	Export find_next_child() so it can be used by efivarfs. Keep it internal for now. There's no reason to advertise this kernel-wide. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	Merge patch series "Extend freeze support to suspend and hibernate"	Christian Brauner
	Christian Brauner <brauner@kernel.org> says: Add the necessary infrastructure changes to support freezing for suspend and hibernate. This should all that's needed to wire up power. * patches from https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org: super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks Link: https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	super: add filesystem freezing helpers for suspend and hibernate	Christian Brauner
	Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09	fs: use writeback_iter directly in mpage_writepages	Christoph Hellwig
	Stop using write_cache_pages and use writeback_iter directly. This removes an indirect call per written folio and makes the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250507062124.3933305-1-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>