summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-24powerpc/xmon: simplify xmon_batch_next_cpu()Yury Norov
The function opencodes for_each_cpu_wrap() macro. As a loop termination condition it uses cpumask_empty(), which is O(N), and it makes the whole algorithm O(N^2). Switching to for_each_cpu_wrap() simplifies the logic, and makes the algorithm linear. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2025-02-24ibmvnic: simplify ibmvnic_set_queue_affinity()Yury Norov
A loop based on cpumask_next_wrap() opencodes the dedicated macro for_each_online_cpu_wrap(). Use it as it improves readability and simplifies maintenance. This also helps to drop cpumask handling code in the caller function. Signed-off-by: Yury Norov <yury.norov@gmail.com> Tested-by: Nick Child <nnac123@linux.ibm.com>
2025-02-24virtio_net: simplify virtnet_set_affinity()Yury Norov
The inner loop may be replaced with the dedicated for_each_online_cpu_wrap. Use it as it improves readability and simplifies maintenance. Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Nick Child <nnac123@linux.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com>
2025-02-24drm/xe/userptr: fix EFAULT handlingMatthew Auld
Currently we treat EFAULT from hmm_range_fault() as a non-fatal error when called from xe_vm_userptr_pin() with the idea that we want to avoid killing the entire vm and chucking an error, under the assumption that the user just did an unmap or something, and has no intention of actually touching that memory from the GPU. At this point we have already zapped the PTEs so any access should generate a page fault, and if the pin fails there also it will then become fatal. However it looks like it's possible for the userptr vma to still be on the rebind list in preempt_rebind_work_func(), if we had to retry the pin again due to something happening in the caller before we did the rebind step, but in the meantime needing to re-validate the userptr and this time hitting the EFAULT. This explains an internal user report of hitting: [ 191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe] [ 191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738690] Call Trace: [ 191.738692] <TASK> [ 191.738694] ? show_regs+0x69/0x80 [ 191.738698] ? __warn+0x93/0x1a0 [ 191.738703] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738759] ? report_bug+0x18f/0x1a0 [ 191.738764] ? handle_bug+0x63/0xa0 [ 191.738767] ? exc_invalid_op+0x19/0x70 [ 191.738770] ? asm_exc_invalid_op+0x1b/0x20 [ 191.738777] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738834] ? ret_from_fork_asm+0x1a/0x30 [ 191.738849] bind_op_prepare+0x105/0x7b0 [xe] [ 191.738906] ? dma_resv_reserve_fences+0x301/0x380 [ 191.738912] xe_pt_update_ops_prepare+0x28c/0x4b0 [xe] [ 191.738966] ? kmemleak_alloc+0x4b/0x80 [ 191.738973] ops_execute+0x188/0x9d0 [xe] [ 191.739036] xe_vm_rebind+0x4ce/0x5a0 [xe] [ 191.739098] ? trace_hardirqs_on+0x4d/0x60 [ 191.739112] preempt_rebind_work_func+0x76f/0xd00 [xe] Followed by NPD, when running some workload, since the sg was never actually populated but the vma is still marked for rebind when it should be skipped for this special EFAULT case. This is confirmed to fix the user report. v2 (MattB): - Move earlier. v3 (MattB): - Update the commit message to make it clear that this indeed fixes the issue. Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.10+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-5-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 6b93cb98910c826c2e2004942f8b060311e43618) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-24drm/xe/userptr: restore invalidation list on errorMatthew Auld
On error restore anything still on the pin_list back to the invalidation list on error. For the actual pin, so long as the vma is tracked on either list it should get picked up on the next pin, however it looks possible for the vma to get nuked but still be present on this per vm pin_list leading to corruption. An alternative might be then to instead just remove the link when destroying the vma. v2: - Also add some asserts. - Keep the overzealous locking so that we are consistent with the docs; updating the docs and related bits will be done as a follow up. Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas") Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-4-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 4e37e928928b730de9aa9a2f5dc853feeebc1742) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-24dma-mapping: update MAINTAINERSChristoph Hellwig
Marek has graciously offered to maintain the dma-mapping tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-24configfs: update MAINTAINERSChristoph Hellwig
Joel will go back to maintain configfs alone on a time permitting basis. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-24x86/percpu: Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N()Uros Bizjak
Unify __pcpu_op1_N() and __pcpu_op2_N() macros to __pcpu_op_N() by applying the macro only to asm mnemonic, not to the mnemonic plus its arguments. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250224071648.15913-1-ubizjak@gmail.com
2025-02-24uprobes: Reject the shared zeropage in uprobe_write_opcode()Tong Tiangen
We triggered the following crash in syzkaller tests: BUG: Bad page state in process syz.7.38 pfn:1eff3 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1eff3 flags: 0x3fffff00004004(referenced|reserved|node=0|zone=1|lastcpupid=0x1fffff) raw: 003fffff00004004 ffffe6c6c07bfcc8 ffffe6c6c07bfcc8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000fffffffe 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x32/0x50 bad_page+0x69/0xf0 free_unref_page_prepare+0x401/0x500 free_unref_page+0x6d/0x1b0 uprobe_write_opcode+0x460/0x8e0 install_breakpoint.part.0+0x51/0x80 register_for_each_vma+0x1d9/0x2b0 __uprobe_register+0x245/0x300 bpf_uprobe_multi_link_attach+0x29b/0x4f0 link_create+0x1e2/0x280 __sys_bpf+0x75f/0xac0 __x64_sys_bpf+0x1a/0x30 do_syscall_64+0x56/0x100 entry_SYSCALL_64_after_hwframe+0x78/0xe2 BUG: Bad rss-counter state mm:00000000452453e0 type:MM_FILEPAGES val:-1 The following syzkaller test case can be used to reproduce: r2 = creat(&(0x7f0000000000)='./file0\x00', 0x8) write$nbd(r2, &(0x7f0000000580)=ANY=[], 0x10) r4 = openat(0xffffffffffffff9c, &(0x7f0000000040)='./file0\x00', 0x42, 0x0) mmap$IORING_OFF_SQ_RING(&(0x7f0000ffd000/0x3000)=nil, 0x3000, 0x0, 0x12, r4, 0x0) r5 = userfaultfd(0x80801) ioctl$UFFDIO_API(r5, 0xc018aa3f, &(0x7f0000000040)={0xaa, 0x20}) r6 = userfaultfd(0x80801) ioctl$UFFDIO_API(r6, 0xc018aa3f, &(0x7f0000000140)) ioctl$UFFDIO_REGISTER(r6, 0xc020aa00, &(0x7f0000000100)={{&(0x7f0000ffc000/0x4000)=nil, 0x4000}, 0x2}) ioctl$UFFDIO_ZEROPAGE(r5, 0xc020aa04, &(0x7f0000000000)={{&(0x7f0000ffd000/0x1000)=nil, 0x1000}}) r7 = bpf$PROG_LOAD(0x5, &(0x7f0000000140)={0x2, 0x3, &(0x7f0000000200)=ANY=[@ANYBLOB="1800000000120000000000000000000095"], &(0x7f0000000000)='GPL\x00', 0x7, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, @fallback=0x30, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x0, @void, @value}, 0x94) bpf$BPF_LINK_CREATE_XDP(0x1c, &(0x7f0000000040)={r7, 0x0, 0x30, 0x1e, @val=@uprobe_multi={&(0x7f0000000080)='./file0\x00', &(0x7f0000000100)=[0x2], 0x0, 0x0, 0x1}}, 0x40) The cause is that zero pfn is set to the PTE without increasing the RSS count in mfill_atomic_pte_zeropage() and the refcount of zero folio does not increase accordingly. Then, the operation on the same pfn is performed in uprobe_write_opcode()->__replace_page() to unconditional decrease the RSS count and old_folio's refcount. Therefore, two bugs are introduced: 1. The RSS count is incorrect, when process exit, the check_mm() report error "Bad rss-count". 2. The reserved folio (zero folio) is freed when folio->refcount is zero, then free_pages_prepare->free_page_is_bad() report error "Bad page state". There is more, the following warning could also theoretically be triggered: __replace_page() -> ... -> folio_remove_rmap_pte() -> VM_WARN_ON_FOLIO(is_zero_folio(folio), folio) Considering that uprobe hit on the zero folio is a very rare case, just reject zero old folio immediately after get_user_page_vma_remote(). [ mingo: Cleaned up the changelog ] Fixes: 7396fa818d62 ("uprobes/core: Make background page replacement logic account for rss_stat counters") Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250224031149.1598949-1-tongtiangen@huawei.com
2025-02-24binfmt: Remove loader from linux_binprm structYonatan Goldschmidt
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed the last in-tree user of the loader field, and as far as I can tell, it was the only one historically. Signed-off-by: Yonatan Goldschmidt <yon.goldschmidt@gmail.com> Link: https://lore.kernel.org/r/20250223223234.13764-1-yon.goldschmidt@gmail.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-24seccomp: avoid the lock trip seccomp_filter_release in common caseMateusz Guzik
Vast majority of threads don't have any seccomp filters, all while the lock taken here is shared between all threads in given process and frequently used. Safety of the check relies on the following: - seccomp_filter_release is only legally called for PF_EXITING threads - SIGNAL_GROUP_EXIT is only ever set with the sighand lock held - PF_EXITING is only ever set with the sighand lock held *or* after SIGNAL_GROUP_EXIT is set *or* the process is single-threaded - seccomp_sync_threads holds the sighand lock and skips all threads if SIGNAL_GROUP_EXIT is set, PF_EXITING threads if not Resulting reduction of contention gives me a 5% boost in a microbenchmark spawning and killing threads within the same process. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250213170911.1140187-1-mjguzik@gmail.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-24coredump: Only sort VMAs when core_sort_vma sysctl is setKees Cook
The sorting of VMAs by size in commit 7d442a33bfe8 ("binfmt_elf: Dump smaller VMAs first in ELF cores") breaks elfutils[1]. Instead, sort based on the setting of the new sysctl, core_sort_vma, which defaults to 0, no sorting. Reported-by: Michael Stapelberg <michael@stapelberg.ch> Closes: https://lore.kernel.org/all/20250218085407.61126-1-michael@stapelberg.de/ [1] Fixes: 7d442a33bfe8 ("binfmt_elf: Dump smaller VMAs first in ELF cores") Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-24ASoC: dapm-graph: set fill colour of turned on nodesNicolas Frattaroli
Some tools like KGraphViewer interpret the "ON" nodes not having an explicitly set fill colour as them being entirely black, which obscures the text on them and looks funny. In fact, I thought they were off for the longest time. Comparing to the output of the `dot` tool, I assume they are supposed to be white. Instead of speclawyering over who's in the wrong and must immediately atone for their wickedness at the altar of RFC2119, just be explicit about it, set the fillcolor to white, and nobody gets confused. Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Link: https://patch.msgid.link/20250221-dapm-graph-node-colour-v1-1-514ed0aa7069@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-02-24ASoC: fsl: Rename stream name of SAI DAI driverChancel Liu
If stream names of DAI driver are duplicated there'll be warnings when machine driver tries to add widgets on a route: [ 8.831335] fsl-asoc-card sound-wm8960: ASoC: sink widget CPU-Playback overwritten [ 8.839917] fsl-asoc-card sound-wm8960: ASoC: source widget CPU-Capture overwritten Use different stream names to avoid such warnings. DAI names in AUDMIX are also updated accordingly. Fixes: 15c958390460 ("ASoC: fsl_sai: Add separate DAI for transmitter and receiver") Signed-off-by: Chancel Liu <chancel.liu@nxp.com> Acked-by: Shengjiu Wang <shengjiu.wang@gmail.com> Link: https://patch.msgid.link/20250217010437.258621-1-chancel.liu@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-02-24perf/core: Order the PMU list to fix warning about unordered pmu_ctx_listLuo Gengkun
Syskaller triggers a warning due to prev_epc->pmu != next_epc->pmu in perf_event_swap_task_ctx_data(). vmcore shows that two lists have the same perf_event_pmu_context, but not in the same order. The problem is that the order of pmu_ctx_list for the parent is impacted by the time when an event/PMU is added. While the order for a child is impacted by the event order in the pinned_groups and flexible_groups. So the order of pmu_ctx_list in the parent and child may be different. To fix this problem, insert the perf_event_pmu_context to its proper place after iteration of the pmu_ctx_list. The follow testcase can trigger above warning: # perf record -e cycles --call-graph lbr -- taskset -c 3 ./a.out & # perf stat -e cpu-clock,cs -p xxx // xxx is the pid of a.out test.c void main() { int count = 0; pid_t pid; printf("%d running\n", getpid()); sleep(30); printf("running\n"); pid = fork(); if (pid == -1) { printf("fork error\n"); return; } if (pid == 0) { while (1) { count++; } } else { while (1) { count++; } } } The testcase first opens an LBR event, so it will allocate task_ctx_data, and then open tracepoint and software events, so the parent context will have 3 different perf_event_pmu_contexts. On inheritance, child ctx will insert the perf_event_pmu_context in another order and the warning will trigger. [ mingo: Tidied up the changelog. ] Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250122073356.1824736-1-luogengkun@huaweicloud.com
2025-02-24Merge tag 'kvm-riscv-fixes-6.14-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 6.14, take #1 - Fix hart status check in SBI HSM extension - Fix hart suspend_type usage in SBI HSM extension - Fix error returned by SBI IPI and TIME extensions for unsupported function IDs - Fix suspend_type usage in SBI SUSP extension - Remove unnecessary vcpu kick after injecting interrupt via IMSIC guest file
2025-02-24Merge tag 'kvmarm-fixes-6.14-3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #3 - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID.
2025-02-24perf/core: Add RCU read lock protection to perf_iterate_ctx()Breno Leitao
The perf_iterate_ctx() function performs RCU list traversal but currently lacks RCU read lock protection. This causes lockdep warnings when running perf probe with unshare(1) under CONFIG_PROVE_RCU_LIST=y: WARNING: suspicious RCU usage kernel/events/core.c:8168 RCU-list traversed in non-reader section!! Call Trace: lockdep_rcu_suspicious ? perf_event_addr_filters_apply perf_iterate_ctx perf_event_exec begin_new_exec ? load_elf_phdrs load_elf_binary ? lock_acquire ? find_held_lock ? bprm_execve bprm_execve do_execveat_common.isra.0 __x64_sys_execve do_syscall_64 entry_SYSCALL_64_after_hwframe This protection was previously present but was removed in commit bd2756811766 ("perf: Rewrite core context handling"). Add back the necessary rcu_read_lock()/rcu_read_unlock() pair around perf_iterate_ctx() call in perf_event_exec(). [ mingo: Use scoped_guard() as suggested by Peter ] Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250117-fix_perf_rcu-v1-1-13cb9210fc6a@debian.org
2025-02-24rust: workqueue: define built-in bh queuesHamza Mahfooz
Provide safe getters to the system bh work queues. They will be used to reimplement the Hyper-V VMBus in rust. Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24ASoC: es8328: fix route from DAC to outputNicolas Frattaroli
The ES8328 codec driver, which is also used for the ES8388 chip that appears to have an identical register map, claims that the output can either take the route from DAC->Mixer->Output or through DAC->Output directly. To the best of what I could find, this is not true, and creates problems. Without DACCONTROL17 bit index 7 set for the left channel, as well as DACCONTROL20 bit index 7 set for the right channel, I cannot get any analog audio out on Left Out 2 and Right Out 2 respectively, despite the DAPM routes claiming that this should be possible. Furthermore, the same is the case for Left Out 1 and Right Out 1, showing that those two don't have a direct route from DAC to output bypassing the mixer either. Those control bits toggle whether the DACs are fed (stale bread?) into their respective mixers. If one "unmutes" the mixer controls in alsamixer, then sure, the audio output works, but if it doesn't work without the mixer being fed the DAC input then evidently it's not a direct output from the DAC. ES8328/ES8388 are seemingly not alone in this. ES8323, which uses a separate driver for what appears to be a very similar register map, simply flips those two bits on in its probe function, and then pretends there is no power management whatsoever for the individual controls. Fair enough. My theory as to why nobody has noticed this up to this point is that everyone just assumes it's their fault when they had to unmute an additional control in ALSA. Fix this in the es8328 driver by removing the erroneous direct route, then get rid of the playback switch controls and have those bits tied to the mixer's widget instead, which until now had no register to play with. Fixes: 567e4f98922c ("ASoC: add es8328 codec driver") Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> Link: https://patch.msgid.link/20250222-es8328-route-bludgeoning-v1-1-99bfb7fb22d9@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-02-24dm vdo: add missing spin_lock_initKen Raeburn
Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org
2025-02-24Merge patch series "enable bs > ps for block devices"Christian Brauner
Luis Chamberlain <mcgrof@kernel.org> says: This v3 series addresses the feedback from the v2 series [0]. The only patch which was mofified was the patch titled "fs/mpage: use blocks_per_folio instead of blocks_per_page". The motivation for this series is to mainly start supporting block devices with logical block sizes larger than 4k, we do this by addressing buffer-head support required for the block device cache. In the future these changes can be leveraged to also start experimenting with LBS support for filesystems which support only buffer-heads. This paves the way for that work. Its perhaps is surprising to some but since this also lifts the block device cache sector size support to 64k, devices which support up to 64k sector sizes can also leverage this to enable filesystems created with larger sector sizes up to 64k sector sizes. The filesystem sector size is used or documented in a bit of obscurity except for few filesystems, but in short it ensures that the filesystem itself will not generate writes iteslef smaller than the specified sector size. In practice this means you can constrain metadata writes as well to a minimum size, and so be completely deterministic with regards to the specified sector size for min IO writes. For example since XFS can supports up to 32k sector size, it means with these changes enable filesystems to also be created on x86_64 with both the filesystem block size and sector size to 32k, now that the block device cache limitation is lifted. Since this touches buffer-heads I've ran this through fstests on ext4 and found no new regressions. I've also used blktests against a kernel built with these changes to test block devices with different larger logical block sizes than 4k on x86_64. All changes to be able to test block devices with a logical block size support > 4k are now merged on upstream blktests. I've tested the block layer with blktests with block devices with logical block sizes up to 64k which is the max we are currently supporting and found no new regressions. Detailed changes in this series: - Modifies the commit log for "fs/buffer: remove batching from async read" as per Willy's request and collects his SOB. - Collects Reviewed-by tags - The patch titled "fs/mpage: use blocks_per_folio instead of blocks_per_page" received more love to account for Willy's point that we should keep accounting in order for nr_pages on mpage. This does this by using folio_nr_pages() on the args passed and adjusts the last_block accounting accordingly. - Through code inspection fixed folio_zero_segment() use to use folio_size() as we move to suppor large folios for unmapped folio segments on do_mpage_readpage(), this is dealt with on the patch titled "fs/mpage: use blocks_per_folio instead of blocks_per_page" as that's when we start accounting large folios into the picture. [0] https://lkml.kernel.org/r/20250204231209.429356-1-mcgrof@kernel.org * patches from https://lore.kernel.org/r/20250221223823.1680616-1-mcgrof@kernel.org: bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset() Link: https://lore.kernel.org/r/20250221223823.1680616-1-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24bdev: use bdev_io_min() for statx block sizeLuis Chamberlain
You can use lsblk to query for a block device block device block size: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 4096 The min-io is the minimum IO the block device prefers for optimal performance. In turn we map this to the block device block size. The current block size exposed even for block devices with an LBA format of 16k is 4k. Likewise devices which support 4k LBA format but have a larger Indirection Unit of 16k have an exposed block size of 4k. This incurs read-modify-writes on direct IO against devices with a min-io larger than the page size. To fix this, use the block device min io, which is the minimal optimal IO the device prefers. With this we now get: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 16384 And so userspace gets the appropriate information it needs for optimal performance. This is verified with blkalgn against mkfs against a device with LBA format of 4k but an NPWG of 16k (min io size) mkfs.xfs -f -b size=16k /dev/nvme3n1 blkalgn -d nvme3n1 --ops Write Block size : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 66 |****************************************| 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 2 |* | Block size: 14 - 66 Block size: 17 - 2 Algn size : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 66 |****************************************| 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 2 |* | Algn size: 14 - 66 Algn size: 17 - 2 Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-9-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24block/bdev: lift block size restrictions to 64kLuis Chamberlain
We now can support blocksizes larger than PAGE_SIZE, so in theory we should be able to lift the restriction up to the max supported page cache order. However bound ourselves to what we can currently validate and test. Through blktests and fstest we can validate up to 64k today. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-8-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24block/bdev: enable large folio support for large logical block sizesHannes Reinecke
Call mapping_set_folio_min_order() when modifying the logical block size to ensure folios are allocated with the correct size. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250221223823.1680616-7-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/buffer fs/mpage: remove large folio restrictionLuis Chamberlain
Now that buffer-heads has been converted over to support large folios we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents their use. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-6-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/mpage: use blocks_per_folio instead of blocks_per_pageLuis Chamberlain
Convert mpage to folios and adjust accounting for the number of blocks within a folio instead of a single page. This also adjusts the number of pages we should process to be the size of the folio to ensure we always read a full folio. Note that the page cache code already ensures do_mpage_readpage() will work with folios respecting the address space min order, this ensures that so long as folio_size() is used for our requirements mpage will also now be able to process block sizes larger than the page size. Originally-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-5-mcgrof@kernel.org Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/mpage: avoid negative shift for large blocksizeHannes Reinecke
For large blocksizes the number of block bits is larger than PAGE_SHIFT, so calculate the sector number from the byte offset instead. This is required to enable large folios with buffer-heads. Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-4-mcgrof@kernel.org Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/buffer: remove batching from async readMatthew Wilcox
block_read_full_folio() currently puts all !uptodate buffers into an array allocated on the stack, then iterates over it twice, first locking the buffers and then submitting them for read. We want to remove this array because it occupies too much stack space on configurations with a larger PAGE_SIZE (eg 512 bytes with 8 byte pointers and a 64KiB PAGE_SIZE). We cannot simply submit buffer heads as we find them as the completion handler needs to be able to tell when all reads are finished, so it can end the folio read. So we keep one buffer in reserve (using the 'prev' variable) until the end of the function. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-3-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/buffer: simplify block_read_full_folio() with bh_offset()Luis Chamberlain
When we read over all buffers in a folio we currently use the buffer index on the folio and blocksize to get the offset. Simplify this with bh_offset(). This simplifies the loop while making no functional changes. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-2-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs: Turn page_offset() into a wrapper around folio_pos()Matthew Wilcox (Oracle)
This is far less efficient for the lagging filesystems which still use page_offset(), but it removes an access to page->index. It also fixes a bug -- if any filesystem passed a tail page to page_offset(), it would return garbage which might result in the filesystem choosing to not writeback a dirty page. There probably aren't any examples of this, but I can't be certain. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250221203932.3588740-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24nsfs: remove d_op->d_deleteChristian Brauner
Nsfs only deals with unhashed dentries and there's currently no way for them to become hashed. So remove d_op->d_delete. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24pidfs: remove d_op->d_deleteChristian Brauner
Pidfs only deals with unhashed dentries and there's currently no way for them to become hashed. So remove d_op->d_delete. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24net: dsa: rtl8366rb: Fix compilation problemLinus Walleij
When the kernel is compiled without LED framework support the rtl8366rb fails to build like this: rtl8366rb.o: in function `rtl8366rb_setup_led': rtl8366rb.c:953:(.text.unlikely.rtl8366rb_setup_led+0xe8): undefined reference to `led_init_default_state_get' rtl8366rb.c:980:(.text.unlikely.rtl8366rb_setup_led+0x240): undefined reference to `devm_led_classdev_register_ext' As this is constantly coming up in different randconfig builds, bite the bullet and create a separate file for the offending code, split out a header with all stuff needed both in the core driver and the leds code. Add a new bool Kconfig option for the LED compile target, such that it depends on LEDS_CLASS=y || LEDS_CLASS=RTL8366RB which make LED support always available when LEDS_CLASS is compiled into the kernel and enforce that if the LEDS_CLASS is a module, then the RTL8366RB driver needs to be a module as well so that modprobe can resolve the dependencies. Fixes: 32d617005475 ("net: dsa: realtek: add LED drivers for rtl8366rb") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502070525.xMUImayb-lkp@intel.com/ Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-23bcachefs: fix bch2_extent_ptr_eq()Kent Overstreet
Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-23Linux 6.14-rc4v6.14-rc4Linus Torvalds
2025-02-23Merge tag 'i2c-for-6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fix from Wolfram Sang: "Revert one cleanup which turned out to eat too much stack space" * tag 'i2c-for-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: core: Allocate temporary client dynamically
2025-02-23x86/ioperm: Use atomic64_inc_return() in ksys_ioperm()Uros Bizjak
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation on targets that define atomic_inc_return() and to remove now unneeded initialization of the %eax/%edx register pair before the call to atomic64_inc_return(). On x86_32 the code improves from: 1b0: b9 00 00 00 00 mov $0x0,%ecx 1b1: R_386_32 .bss 1b5: 89 43 0c mov %eax,0xc(%ebx) 1b8: 31 d2 xor %edx,%edx 1ba: b8 01 00 00 00 mov $0x1,%eax 1bf: e8 fc ff ff ff call 1c0 <ksys_ioperm+0xa8> 1c0: R_386_PC32 atomic64_add_return_cx8 1c4: 89 03 mov %eax,(%ebx) 1c6: 89 53 04 mov %edx,0x4(%ebx) to: 1b0: be 00 00 00 00 mov $0x0,%esi 1b1: R_386_32 .bss 1b5: 89 43 0c mov %eax,0xc(%ebx) 1b8: e8 fc ff ff ff call 1b9 <ksys_ioperm+0xa1> 1b9: R_386_PC32 atomic64_inc_return_cx8 1bd: 89 03 mov %eax,(%ebx) 1bf: 89 53 04 mov %edx,0x4(%ebx) Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250223161355.3607-1-ubizjak@gmail.com
2025-02-23Merge tag 'edac_urgent_for_v6.14_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fix from Borislav Petkov: - Have qcom_edac use the correct interrupt enable register to configure the RAS interrupt lines * tag 'edac_urgent_for_v6.14_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/qcom: Correct interrupt enable register configuration
2025-02-23efivarfs: Defer PM notifier registration until .fill_superArd Biesheuvel
syzbot reports an issue that turns out to be caused by the fact that the efivarfs PM notifier may be invoked before the efivarfs_fs_info::sb field is populated, resulting in a NULL deference. So defer the registration until efivarfs_fill_super() is invoked. Reported-by: syzbot+00d13e505ef530a45100@syzkaller.appspotmail.com Tested-by: syzbot+00d13e505ef530a45100@syzkaller.appspotmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-02-23efi/cper: Fix cper_arm_ctx_info alignmentPatrick Rudolph
According to the UEFI Common Platform Error Record appendix, the processor context information structure is a variable length structure, but "is padded with zeros if the size is not a multiple of 16 bytes". Currently this isn't honoured, causing all but the first structure to be garbage when printed. Thus align the size to be a multiple of 16. Signed-off-by: Patrick Rudolph <patrick.rudolph@9elements.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-02-23efi/cper: Fix cper_ia_proc_ctx alignmentPatrick Rudolph
According to the UEFI Common Platform Error Record appendix, the IA32/X64 Processor Context Information Structure is a variable length structure, but "is padded with zeros if the size is not a multiple of 16 bytes". Currently this isn't honoured, causing all but the first structure to be garbage when printed. Thus align the size to be a multiple of 16. Signed-off-by: Patrick Rudolph <patrick.rudolph@9elements.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-02-23RDMA/bnxt_re: Fix the page details for the srq created by kernel consumersKashyap Desai
While using nvme target with use_srq on, below kernel panic is noticed. [ 549.698111] bnxt_en 0000:41:00.0 enp65s0np0: FEC autoneg off encoding: Clause 91 RS(544,514) [ 566.393619] Oops: divide error: 0000 [#1] PREEMPT SMP NOPTI .. [ 566.393799] <TASK> [ 566.393807] ? __die_body+0x1a/0x60 [ 566.393823] ? die+0x38/0x60 [ 566.393835] ? do_trap+0xe4/0x110 [ 566.393847] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393867] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393881] ? do_error_trap+0x7c/0x120 [ 566.393890] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393911] ? exc_divide_error+0x34/0x50 [ 566.393923] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393939] ? asm_exc_divide_error+0x16/0x20 [ 566.393966] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393997] bnxt_qplib_create_srq+0xc9/0x340 [bnxt_re] [ 566.394040] bnxt_re_create_srq+0x335/0x3b0 [bnxt_re] [ 566.394057] ? srso_return_thunk+0x5/0x5f [ 566.394068] ? __init_swait_queue_head+0x4a/0x60 [ 566.394090] ib_create_srq_user+0xa7/0x150 [ib_core] [ 566.394147] nvmet_rdma_queue_connect+0x7d0/0xbe0 [nvmet_rdma] [ 566.394174] ? lock_release+0x22c/0x3f0 [ 566.394187] ? srso_return_thunk+0x5/0x5f Page size and shift info is set only for the user space SRQs. Set page size and page shift for kernel space SRQs also. Fixes: 0c4dcd602817 ("RDMA/bnxt_re: Refactor hardware queue memory allocation") Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1740237621-29291-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s ↵Randy Dunlap
description Use @addr instead of @vaddr in the kernel-doc comment for clean_cache_range() to eliminate warnings: arch/x86/lib/usercopy_64.c:29: warning: Function parameter or struct member 'addr' not described in 'clean_cache_range' arch/x86/lib/usercopy_64.c:29: warning: Excess function parameter 'vaddr' description in 'clean_cache_range' Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250111063333.911084-1-rdunlap@infradead.org
2025-02-23RDMA/mlx5: Fix bind QP error cleanup flowPatrisious Haddad
When there is a failure during bind QP, the cleanup flow destroys the counter regardless if it is the one that created it or not, which is problematic since if it isn't the one that created it, that counter could still be in use. Fix that by destroying the counter only if it was created during this call. Fixes: 45842fc627c7 ("IB/mlx5: Support statistic q counter configuration") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/25dfefddb0ebefa668c32e06a94d84e3216257cf.1740033937.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23arm64: dts: freescale: imx8mm-verdin-dahlia: add Microphone Jack to sound cardStefan Eichenberger
The simple-audio-card's microphone widget currently connects to the headphone jack. Routing the microphone input to the microphone jack allows for independent operation of the microphone and headphones. This resolves the following boot-time kernel log message, which indicated a conflict when the microphone and headphone functions were not separated: debugfs: File 'Headphone Jack' in directory 'dapm' already present! Fixes: 6a57f224f734 ("arm64: dts: freescale: add initial support for verdin imx8m mini") Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> Cc: <stable@vger.kernel.org> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2025-02-23arm64: dts: freescale: imx8mp-verdin-dahlia: add Microphone Jack to sound cardStefan Eichenberger
The simple-audio-card's microphone widget currently connects to the headphone jack. Routing the microphone input to the microphone jack allows for independent operation of the microphone and headphones. This resolves the following boot-time kernel log message, which indicated a conflict when the microphone and headphone functions were not separated: debugfs: File 'Headphone Jack' in directory 'dapm' already present! Fixes: 874958916844 ("arm64: dts: freescale: verdin-imx8mp: dahlia: add sound card") Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> Cc: <stable@vger.kernel.org> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2025-02-23soc: imx8m: Unregister cpufreq and soc dev in cleanup pathPeng Fan
Unregister the cpufreq device and soc device when resource unwinding, otherwise there will be warning when do removing test: sysfs: cannot create duplicate filename '/devices/platform/imx-cpufreq-dt' CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.13.0-rc1-next-20241204 Hardware name: NXP i.MX8MPlus EVK board (DT) Fixes: 9cc832d37799 ("soc: imx8m: Probe the SoC driver as platform driver") Cc: Marco Felsch <m.felsch@pengutronix.de> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Marco Felsch <m.felsch@pengutronix.de> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2025-02-22Merge tag 'v6.14-rc3-smb3-client-fix-part2' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull smb client fix from Steve French: - Fix potential null pointer dereference * tag 'v6.14-rc3-smb3-client-fix-part2' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Add check for next_buffer in receive_encrypted_standard()