summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-09io_uring: open code io_account_cq_overflow()Pavel Begunkov
io_account_cq_overflow() doesn't help explaining what's going on in there, and it'll become even smaller with following patches, so open code it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: consolidate drain seq checkingPavel Begunkov
We check sequences when queuing drained requests as well when flushing them. Instead, always queue and immediately try to flush, so that all seq handling can be kept contained in the flushing code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: remove drain prealloc checksPavel Begunkov
Currently io_drain_req() has two steps. The first is fast path checking sequence numbers. The second is allocations, rechecking and actual queuing. Further simplify it by removing the first step. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: simplify drain ret passingPavel Begunkov
"ret" in io_drain_req() is only used in one place, remove it and pass -ENOMEM directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: fix spurious drain flushingPavel Begunkov
io_queue_deferred() is not tolerant to spurious calls not completing some requests. You can have an inflight drain-marked request and another request that came after and got queued into the drain list. Now, if io_queue_deferred() is called before the first request completes, it'll check the 2nd req with req_need_defer(), find that there is no drain flag set, and queue it for execution. To make io_queue_deferred() work, it should at least check sequences for the first request, and then we need also need to check if there is another drain request creating another bubble. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: account drain memory to cgroupPavel Begunkov
Account drain allocations against memcg. It's not a big problem as each such allocation is paired with a request, which is accounted, but it's nicer to follow the limits more closely. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09PM: sysfs: Move debug runtime PM attributes to runtime_attrs[]Rafael J. Wysocki
Some of the debug sysfs attributes for runtime PM are located in the power_attrs[] table, so they are exposed even in the pm_runtime_has_no_callbacks() case, unlike the other non-debug sysfs attributes for runtime PM, which may be confusing. Moreover, dev_attr_runtime_status.attr appears in two places, which effectively causes it to be always exposed if CONFIG_PM_ADVANCED_DEBUG is set, but otherwise it is exposed only when pm_runtime_has_no_callbacks() returns 'false'. Address this by putting all sysfs attributes for runtime PM into runtime_attrs[]. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/12677254.O9o76ZdvQC@rjwysocki.net
2025-05-09PM: hibernate: add configurable delay for pm_testZihuan Zhang
Turn the default 5 second test delay for hibernation into a configurable module parameter, so users can determine how long to wait in this pseudo-hibernate state before resuming the system. The configurable delay parameter has been added for suspend, so add an analogous one for hibernation. Example (wait 30 seconds); # echo 30 > /sys/module/hibernate/parameters/pm_test_delay # echo core > /sys/power/pm_test Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20250507063520.419635-1-zhangzihuan@kylinos.cn [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-09io_uring: add lockdep asserts to io_add_aux_cqePavel Begunkov
io_add_aux_cqe() can only be called for rings with uring_lock protected completion queues, add a couple of assertions in regards to that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring/net: move CONFIG_NET guards to MakefilePavel Begunkov
Instruct Makefile to never try to compile net.c without CONFIG_NET and kill ifdefs in the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: update parameter name in io_pin_pages function declarationLong Li
Rename first parameter in io_pin_pages from ubuf to uaddr for consistency between declaration and implementation. Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring/sqpoll: Increase task_work submission batch sizeGabriel Krisman Bertazi
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio. There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer. My benchmark is: fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio: Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82 If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue. In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7 After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32 Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()Zijun Hu
pm_show_wakelocks() is called to generate a string when showing attributes /sys/power/wake_(lock|unlock), but the string ends with an unwanted space that was added back by mistake by commit c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()"). Remove the unwanted space. Fixes: c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()") Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://patch.msgid.link/20250505-fix_power-v1-1-0f7f2c2f338c@quicinc.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-09PM: wakeup: Add missing wakeup source attribute relax_countZijun Hu
There is wakeup source attribute 'active_count', but its counterpart attribute 'relax_count' is missing. Add 'relax_count' for consistency. Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://patch.msgid.link/20250505-add_power_attrs-v1-1-10bc3c73c320@quicinc.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-09Merge tag 'amd-pstate-v6.16-2025-05-08' of ↵Rafael J. Wysocki
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux Merge amd-pstate content for 6.16 (5/8/25) from Mario Limonciello: - Add support for a new feature on some BIOS that allows setting "lowest CPU minimum frequency". - Fix the amd-pstate-ut unit tests to restore system settings when done. * tag 'amd-pstate-v6.16-2025-05-08' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: amd-pstate-ut: Reset amd-pstate driver mode after running selftests cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver cpufreq/amd-pstate: Move max_perf limiting in amd_pstate_update
2025-05-09Merge branch 'net_sched-gso_skb-flushing'David S. Miller
Cong Wang says: ==================== net_sched: Fix gso_skb flushing during qdisc change This patchset contains a bug fix and its test cases, please check each patch description for more details. To keep the bug fix minimum, I intentionally limit the code changes to the cases reported here. --- v2: added a missing qlen-- fixed the new boolean parameter for two qdiscs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09selftests/tc-testing: Add qdisc limit trimming testsCong Wang
Added new test cases for FQ, FQ_CODEL, FQ_PIE, and HHF qdiscs to verify queue trimming behavior when the qdisc limit is dynamically reduced. Each test injects packets, reduces the qdisc limit, and checks that the new limit is enforced. This is still best effort since timing qdisc backlog is not easy. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09net_sched: Flush gso_skb list too during ->change()Cong Wang
Previously, when reducing a qdisc's limit via the ->change() operation, only the main skb queue was trimmed, potentially leaving packets in the gso_skb list. This could result in NULL pointer dereference when we only check sch->limit against sch->q.qlen. This patch introduces a new helper, qdisc_dequeue_internal(), which ensures both the gso_skb list and the main queue are properly flushed when trimming excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie) are updated to use this helper in their ->change() routines. Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM") Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme") Reported-by: Will <willsroot@protonmail.com> Reported-by: Savy <savy@syst3mfailure.io> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09Merge patch series "Minor namespace code simplication"Christian Brauner
Joel Savitz <jsavitz@redhat.com> says: The two patches are independent of each other. The first patch removes unnecssary NULL guards from free_nsproxy() and create_new_namespaces() in line with other usage of the put_*_ns() call sites. The second patch slightly reduces the size of the kernel when CONFIG_CGROUPS is not selected. * patches from https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com: include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards Link: https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09include/cgroup: separate {get,put}_cgroup_ns no-op caseJoel Savitz
When CONFIG_CGROUPS is not selected, {get,put}_cgroup_ns become no-ops and therefore it is not necessary to compile in the code for changing the reference count. When CONFIG_CGROUP is selected, there is no valid case where either of {get,put}_cgroup_ns() will be called with a NULL argument. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/20250508184930.183040-3-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09kernel/nsproxy: remove unnecessary guardsJoel Savitz
In free_nsproxy() and the error path of create_new_namesapces() the put_*_ns() calls are guarded by unnecessary NULL checks. put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never receive a NULL argument unless their namespace type is disabled, and in this case all four become no-ops at compile time anyway. put_mnt_ns() will never receive a null argument at any time. This unguarded usage is in line with other call sites of put_*_ns(). Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09f2fs: fix freezing filesystem during resizeChristian Brauner
Using FREEZE_HOLDER_USERSPACE has two consequences: (1) If userspace freezes the filesystem after mnt_drop_write_file() but before freeze_super() was called filesystem resizing will fail because the freeze isn't marked as nestable. (2) If the kernel has successfully frozen the filesystem via FREEZE_HOLDER_USERSPACE userspace can simply undo it by using the FITHAW ioctl. Fix both issues by using FREEZE_HOLDER_KERNEL. It will nest with FREEZE_HOLDER_USERSPACE and cannot be undone by userspace. And it is the correct thing to do because the kernel temporarily freezes the filesystem. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09Merge patch series "power: wire-up filesystem freeze/thaw with suspend/resume"Christian Brauner
Christian Brauner <brauner@kernel.org> says: Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. Othwerwise it risks thawing filesystems it didn't own. This could be done differently be e.g., keeping the filesystems that were actually frozen on a list and then unfreezing them from that list. This is disgustingly unclean though and reeks of an ugly hack. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. * patches from https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org: kernfs: add warning about implementing freeze/thaw power: freeze filesystems during suspend/resume Link: https://lore.kernel.org/r/20250402-work-freeze-v2-0-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09Merge patch series "efivarfs: support freeze/thaw"Christian Brauner
Christian Brauner <brauner@kernel.org> says: Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. This works without any big issues and congrats afaict efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care. * patches from https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org: efivarfs: support freeze/thaw libfs: export find_next_child() Link: https://lore.kernel.org/r/20250331-work-freeze-v1-0-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09kernfs: add warning about implementing freeze/thawChristian Brauner
Sysfs is built on top of kernfs and sysfs provides the power management infrastructure to support suspend/hibernate by writing to various files in /sys/power/. As filesystems may be automatically frozen during suspend/hibernate implementing freeze/thaw support for kernfs generically will cause deadlocks as the suspending/hibernation initiating task will hold a VFS lock that it will then wait upon to be released. If freeze/thaw for kernfs is needed talk to the VFS. Link: https://lore.kernel.org/r/20250402-work-freeze-v2-4-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09efivarfs: support freeze/thawChristian Brauner
Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. This works without any big issues and congrats afaict efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. We really really don't need to care. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-2-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09power: freeze filesystems during suspend/resumeChristian Brauner
Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. We add a new sysctl know /sys/power/freeze_filesystems that will allow userspace to freeze filesystems during suspend/hibernate. For now it defaults to off. The thaw logic doesn't require checking whether freezing is enabled because the power subsystem exclusively owns frozen filesystems for the duration of suspend/hibernate and is able to skip filesystems it doesn't need to freeze. Also it is technically possible that filesystem filesystem_freeze_enabled is true and power freezes the filesystems but before freezing all processes another process disables filesystem_freeze_enabled. If power were to place the filesystems_thaw() call under filesystems_freeze_enabled it would fail to thaw the fileystems it frozw. The exclusive holder mechanism makes it possible to iterate through the list without any concern making sure that no filesystems are left frozen. Link: https://lore.kernel.org/r/20250402-work-freeze-v2-3-6719a97b52ac@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09libfs: export find_next_child()Christian Brauner
Export find_next_child() so it can be used by efivarfs. Keep it internal for now. There's no reason to advertise this kernel-wide. Link: https://lore.kernel.org/r/20250331-work-freeze-v1-1-6dfbe8253b9f@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09Merge patch series "Extend freeze support to suspend and hibernate"Christian Brauner
Christian Brauner <brauner@kernel.org> says: Add the necessary infrastructure changes to support freezing for suspend and hibernate. This should all that's needed to wire up power. * patches from https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org: super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks Link: https://lore.kernel.org/r/20250329-work-freeze-v2-0-a47af37ecc3d@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09super: add filesystem freezing helpers for suspend and hibernateChristian Brauner
Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fs: use writeback_iter directly in mpage_writepagesChristoph Hellwig
Stop using write_cache_pages and use writeback_iter directly. This removes an indirect call per written folio and makes the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250507062124.3933305-1-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09Merge patch series "iomap: misc buffered write path cleanups and prep"Christian Brauner
Brian Foster <bfoster@redhat.com> says: Here's a bit more fallout and prep. work associated with the folio batch prototype posted a while back [1]. Work on that is still pending so it isn't included here, but based on the iter advance cleanups most of these seemed worthwhile as standalone cleanups. Mainly this just cleans up some of the helpers and pushes some pos/len trimming further down in the write begin path. The fbatch thing is still in prototype stage, but for context the intent here is that it can mostly now just bolt onto the folio lookup path because we can advance the range that is skipped and return the next folio along with the folio subrange for the caller to process. [1] https://lore.kernel.org/linux-fsdevel/20241213150528.1003662-1-bfoster@redhat.com/ * patches from https://lore.kernel.org/20250506134118.911396-1-bfoster@redhat.com: iomap: rework iomap_write_begin() to return folio offset and length iomap: push non-large folio check into get folio path iomap: helper to trim pos/bytes to within folio iomap: drop pos param from __iomap_[get|put]_folio() iomap: drop unnecessary pos param from iomap_write_[begin|end] iomap: resample iter->pos after iomap_write_begin() calls Link: https://lore.kernel.org/20250506134118.911396-1-bfoster@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: rework iomap_write_begin() to return folio offset and lengthBrian Foster
iomap_write_begin() returns a folio based on current pos and remaining length in the iter, and each caller then trims the pos/length to the given folio. Clean this up a bit and let iomap_write_begin() return the trimmed range along with the folio. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-7-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: push non-large folio check into get folio pathBrian Foster
The len param to __iomap_get_folio() is primarily a folio allocation hint. iomap_write_begin() already trims its local len variable based on the provided folio, so move the large folio support check closer to folio lookup. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-6-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: helper to trim pos/bytes to within folioBrian Foster
Several buffered write based iteration callbacks duplicate logic to trim the current pos and length to within the current folio. Factor this into a helper to make it easier to relocate closer to folio lookup. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-5-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: drop pos param from __iomap_[get|put]_folio()Brian Foster
Both helpers take the iter and pos as parameters. All callers effectively pass iter->pos, so drop the unnecessary pos parameter. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-4-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: drop unnecessary pos param from iomap_write_[begin|end]Brian Foster
iomap_write_begin() and iomap_write_end() both take the iter and iter->pos as parameters. Drop the unnecessary pos parameter and sample iter->pos within each function. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-3-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09iomap: resample iter->pos after iomap_write_begin() callsBrian Foster
In preparation for removing the pos parameter, push the local pos assignment down after calls to iomap_write_begin(). Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/20250506134118.911396-2-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fs: Remove redundant errseq_set call in mark_buffer_write_io_error.Jeremy Bongio
mark_buffer_write_io_error sets sb->s_wb_err to -EIO twice. Once in mapping_set_error and once in errseq_set. Only mapping_set_error checks if bh->b_assoc_map->host is NULL. Discovered during null pointer dereference during writeback to a failing device: [<ffffffff9a416dc8>] ? mark_buffer_write_io_error+0x98/0xc0 [<ffffffff9a416dbe>] ? mark_buffer_write_io_error+0x8e/0xc0 [<ffffffff9ad4bda0>] end_buffer_async_write+0x90/0xd0 [<ffffffff9ad4e3eb>] end_bio_bh_io_sync+0x2b/0x40 [<ffffffff9adbafe6>] blk_update_request+0x1b6/0x480 [<ffffffff9adbb3d8>] blk_mq_end_request+0x18/0x30 [<ffffffff9adbc6aa>] blk_mq_dispatch_rq_list+0x4da/0x8e0 [<ffffffff9adc0a68>] __blk_mq_sched_dispatch_requests+0x218/0x6a0 [<ffffffff9adc07fa>] blk_mq_sched_dispatch_requests+0x3a/0x80 [<ffffffff9adbbb98>] blk_mq_run_hw_queue+0x108/0x330 [<ffffffff9adbcf58>] blk_mq_flush_plug_list+0x178/0x5f0 [<ffffffff9adb6741>] __blk_flush_plug+0x41/0x120 [<ffffffff9adb6852>] blk_finish_plug+0x22/0x40 [<ffffffff9ad47cb0>] wb_writeback+0x150/0x280 [<ffffffff9ac5343f>] ? set_worker_desc+0x9f/0xc0 [<ffffffff9ad4676e>] wb_workfn+0x24e/0x4a0 Fixes: 485e9605c0573 ("fs/buffer.c: record blockdev write errors in super_block that it backs") Signed-off-by: Jeremy Bongio <jbongio@google.com> Link: https://lore.kernel.org/20250507123010.1228243-1-jbongio@google.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09arm64: dts: imx8mp: use 800MHz NoC OPP for nominal drive modeAhmad Fatoum
When running in nominal drive mode, the maximum allowed frequency for the NoC is 800MHz, but the OPP table for the i.MX8MP interconnect device listed the 1GHz operating point for the NoC, regardless of the active mode. The newly introduced imx8mp-nominal.dtsi header reconfigures the clock controller to observe nominal drive mode limits, so have it modify the maximum NoC OPP as well. Fixes: 255fbd9eabe7 ("arm64: dts: imx8mp: Add optional nominal drive mode DTSI") Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2025-05-09ASoc: SOF: topology: connect DAI to a single DAI linkKai Vehmanen
The partial matching of DAI widget to link names, can cause problems if one of the widget names is a substring of another. E.g. with names "Foo1" and Foo10", it's not possible to correctly link up "Foo1". Modify the logic so that if multiple DAI links match the widget stream name, prioritize a full match if one is found. Fixes: fe88788779fc ("ASoC: SOF: topology: Use partial match for connecting DAI link and DAI widget") Link: https://github.com/thesofproject/linux/issues/5308 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Link: https://patch.msgid.link/20250509085318.13936-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-09ASoC: SOF: Intel: hda-bus: Use PIO mode on ACE2+ platformsPeter Ujfalusi
Keep using the PIO mode for commands on ACE2+ platforms, similarly how the legacy stack is configured. Fixes: 05cf17f1bf6d ("ASoC: SOF: Intel: hda-bus: Use PIO mode for Lunar Lake") Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250509081308.13784-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-09ASoC: SOF: ipc4-pcm: Delay reporting is only supported for playback directionPeter Ujfalusi
The firmware does not provide any information for capture streams via the shared pipeline registers. To avoid reporting invalid delay value for capture streams to user space we need to disable it. Fixes: af74dbd0dbcf ("ASoC: SOF: ipc4-pcm: allocate time info for pcm delay feature") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Link: https://patch.msgid.link/20250509085951.15696-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-09ASoC: SOF: ipc4-control: Use SOF_CTRL_CMD_BINARY as numid for bytes_extPeter Ujfalusi
The header.numid is set to scontrol->comp_id in bytes_ext_get and it is ignored during bytes_ext_put. The use of comp_id is not quite great as it is kernel internal identification number. Set the header.numid to SOF_CTRL_CMD_BINARY during get and validate the numid during put to provide consistent and compatible identification number as IPC3. For IPC4 existing tooling also ignored the numid but with the use of SOF_CTRL_CMD_BINARY the different handling of the blobs can be dropped, providing better user experience. Reported-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com> Closes: https://github.com/thesofproject/linux/issues/5282 Fixes: a062c8899fed ("ASoC: SOF: ipc4-control: Add support for bytes control get and put") Cc: stable@vger.kernel.org Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Liam Girdwood <liam.r.girdwood@intel.com> Link: https://patch.msgid.link/20250509085633.14930-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-09Add RZ/G3E xSPI supportMark Brown
Merge series from Biju Das <biju.das.jz@bp.renesas.com>: The xSPI IP found on RZ/G3E SoC similar to RPC-IF interface, but it can support writes on memory-mapped area. Even though the registers are different, the rpcif driver code can be reused for xSPI by adding wrapper functions.
2025-05-09configfs: Correct error value returned by API config_item_set_name()Zijun Hu
kvasprintf() failure is often caused by memory allocation which has error code -ENOMEM, but config_item_set_name() returns -EFAULT for the failure. Fix by returning -ENOMEM instead of -EFAULT for the failure. Reviewed-by: Joel Becker <jlbec@evilplan.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-3-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09configfs: Do not override creating attribute file failure in populate_attrs()Zijun Hu
populate_attrs() may override failure for creating attribute files by success for creating subsequent bin attribute files, and have wrong return value. Fix by creating bin attribute files under successfully creating attribute files. Fixes: 03607ace807b ("configfs: implement binary attributes") Cc: stable@vger.kernel.org Reviewed-by: Joel Becker <jlbec@evilplan.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-2-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09configfs: Delete semicolon from macro type_print() definitionZijun Hu
Macro type_print() definition ends with semicolon, so will cause the subsequent macro invocations end with two semicolons. Fix by deleting the semicolon from the macro definition. Reviewed-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://lore.kernel.org/r/20250507-fix_configfs-v3-1-fe2d96de8dc4@quicinc.com Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
2025-05-09drm/i915/dp: Fix determining SST/MST mode during MTP TU state computationImre Deak
Determining the SST/MST mode during state computation must be done based on the output type stored in the CRTC state, which in turn is set once based on the modeset connector's SST vs. MST type and will not change as long as the connector is using the CRTC. OTOH the MST mode indicated by the given connector's intel_dp::is_mst flag can change independently of the above output type, based on what sink is at any moment plugged to the connector. Fix the state computation accordingly. Cc: Jani Nikula <jani.nikula@intel.com> Fixes: f6971d7427c2 ("drm/i915/mst: adapt intel_dp_mtp_tu_compute_config() for 128b/132b SST") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4607 Reviewed-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://lore.kernel.org/r/20250507151953.251846-1-imre.deak@intel.com (cherry picked from commit 0f45696ddb2b901fbf15cb8d2e89767be481d59f) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-05-09Merge tag 'atomic-writes-6.16_2025-05-07' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into atomic_writes large atomic writes for xfs [v12.1] Currently atomic write support for xfs is limited to writing a single block as we have no way to guarantee alignment and that the write covers a single extent. This series introduces a method to issue atomic writes via a software-based method. The software-based method is used as a fallback for when attempting to issue an atomic write over misaligned or multiple extents. For xfs, this support is based on reflink CoW support. The basic idea of this CoW method is to alloc a range in the CoW fork, write the data, and atomically update the mapping. Initial mysql performance testing has shown this method to perform ok. However, there we are only using 16K atomic writes (and 4K block size), so typically - and thankfully - this software fallback method won't be used often. For other FSes which want large atomics writes and don't support CoW, I think that they can follow the example in [0]. Catherine is currently working on further xfstests for this feature, which we hope to share soon. About 17/17, maybe it can be omitted as there is no strong demand to have it included. Based on bfecc4091e07 (xfs/next-rc, xfs/for-next) xfs: allow ro mounts if rtdev or logdev are read-only [0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ Differences to v12: - add more review tags Differences to v11: - split "xfs: ignore ..." patch - inline sync_blockdev() in xfs_alloc_buftarg() (Christoph) - fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick) - Add RB tag from Christoph (thanks!) Differences to v10: - add "xfs: only call xfs_setsize_buftarg once ..." by Darrick - symbol renames in "xfs: ignore HW which cannot..." by Darrick Differences to v9: - rework "ignore HW which cannot .." patch by Darrick - Ensure power-of-2 max always for unit min/max when no HW support With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>