summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-14sched: Fix proxy/current (push,pull)abilityValentin Schneider
Proxy execution forms atomic pairs of tasks: The waiting donor task (scheduling context) and a proxy (execution context). The donor task, along with the rest of the blocked chain, follows the proxy wrt CPU placement. They can be the same task, in which case push/pull doesn't need any modification. When they are different, however, FIFO1 & FIFO42: ,-> RT42 | | blocked-on | v blocked_donor | mutex | | owner | v `-- RT1 RT1 RT42 CPU0 CPU1 ^ ^ | | overloaded !overloaded rq prio = 42 rq prio = 0 RT1 is eligible to be pushed to CPU1, but should that happen it will "carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as push/pullable. Unfortunately, only the donor task is usually dequeued from the rq, and the proxy'ed execution context (rq->curr) remains on the rq. This can cause RT1 to be selected for migration from logic like the rt pushable_list. Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule returns, which allows the sched class logic to avoid adding the now current task to the pushable_list. Furthermore, tasks becoming blocked on a mutex don't need an explicit dequeue/enqueue cycle to be made (push/pull)able: they have to be running to block on a mutex, thus they will eventually hit put_prev_task(). Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-8-jstultz@google.com
2025-07-14sched: Add an initial sketch of the find_proxy_task() functionJohn Stultz
Add a find_proxy_task() function which doesn't do much. When we select a blocked task to run, we will just deactivate it and pick again. The exception being if it has become unblocked after find_proxy_task() was called. This allows us to validate keeping blocked tasks on the runqueue and later deactivating them is working ok, stressing the failure cases for when a proxy isn't found. Greatly simplified from patch by: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> [jstultz: Split out from larger proxy patch and simplified for review and testing.] Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-7-jstultz@google.com
2025-07-14sched: Fix runtime accounting w/ split exec & sched contextsJohn Stultz
Without proxy-exec, we normally charge the "current" task for both its vruntime as well as its sum_exec_runtime. With proxy, however, we have two "current" contexts: the scheduler context and the execution context. We want to charge the execution context rq->curr (ie: proxy/lock holder) execution time to its sum_exec_runtime (so it's clear to userland the rq->curr task *is* running), as well as its thread group. However the rest of the time accounting (such a vruntime and cgroup accounting), we charge against the scheduler context (rq->donor) task, because it is from that task that the time is being "donated". If the donor and curr tasks are the same, then it's the same as without proxy. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-6-jstultz@google.com
2025-07-14sched: Move update_curr_task logic into update_curr_seJohn Stultz
Absorb update_curr_task() into update_curr_se(), and in the process simplify update_curr_common(). This will make the next step a bit easier. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-5-jstultz@google.com
2025-07-14locking/mutex: Add p->blocked_on wrappers for correctness checksValentin Schneider
This lets us assert mutex::wait_lock is held whenever we access p->blocked_on, as well as warn us for unexpected state changes. [fix conflicts, call in more places] [jstultz: tweaked commit subject, reworked a good bit] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-4-jstultz@google.com
2025-07-14locking/mutex: Rework task_struct::blocked_onPeter Zijlstra
Track the blocked-on relation for mutexes, to allow following this relation at schedule time. task | blocked-on v mutex | owner v task This all will be used for tracking blocked-task/mutex chains with the prox-execution patch in a similar fashion to how priority inheritance is done with rt_mutexes. For serialization, blocked-on is only set by the task itself (current). And both when setting or clearing (potentially by others), is done while holding the mutex::wait_lock. [minor changes while rebasing] [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-3-jstultz@google.com
2025-07-14sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disableJohn Stultz
Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument sched_proxy_exec= that can be used to disable the feature at boot time if CONFIG_SCHED_PROXY_EXEC was enabled. Also uses this option to allow the rq->donor to be different from rq->curr. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-2-jstultz@google.com
2025-07-14Merge branch 'tip/sched/urgent'Peter Zijlstra
Avoid merge conflicts Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-07-14sched/topology: Remove sched_domain_topology_level::flagsK Prateek Nayak
Support for overlapping domains added in commit e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") also allowed forcefully setting SD_OVERLAP for !NUMA domains via FORCE_SD_OVERLAP sched_feat(). Since NUMA domains had to be presumed overlapping to ensure correct behavior, "sched_domain_topology_level::flags" was introduced. NUMA domains added the SDTL_OVERLAP flag would ensure SD_OVERLAP was always added during build_sched_domains() for these domains, even when FORCE_SD_OVERLAP was off. Condition for adding the SD_OVERLAP flag at the aforementioned commit was as follows: if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; The FORCE_SD_OVERLAP debug feature was removed in commit af85596c74de ("sched/topology: Remove FORCE_SD_OVERLAP") which left the NUMA domains as the exclusive users of SDTL_OVERLAP, SD_OVERLAP, and SD_NUMA flags. Get rid of SDTL_OVERLAP and SD_OVERLAP as they have become redundant and instead rely on SD_NUMA to detect the only overlapping domain currently supported. Since SDTL_OVERLAP was the only user of "tl->flags", get rid of "sched_domain_topology_level::flags" too. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/ba4dbdf8-bc37-493d-b2e0-2efb00ea3e19@amd.com
2025-07-14x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabledLi Chen
Currently, the SMT domain is added into sched_domain_topology by default. If cpu_attach_domain() finds that the CPU SMT domain’s cpumask_weight is just 1, it will destroy it. On a large machine, such as one with 512 cores, this results in 512 redundant domain attach/destroy operations. Avoid these unnecessary operations by simply checking cpu_smt_num_threads and skip SMT domain if the SMT domain is not enabled. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-5-me@linux.beauty
2025-07-14x86/smpboot: moves x86_topology to static initialize and truncateLi Chen
The #ifdeffery and the initializers in build_sched_topology() are just disgusting. Statically initialize the domain levels in the topology array and let build_sched_topology() invalidate the package domain level when NUMA in package is available. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-4-me@linux.beauty
2025-07-14x86/smpboot: remove redundant CONFIG_SCHED_SMTLi Chen
On x86 CONFIG_SCHED_SMT is default y if SMP is enabled, so let's simply drop CONFIG_SCHED_SMT. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-3-me@linux.beauty
2025-07-14smpboot: introduce SDTL_INIT() helper to tidy sched topology setupLi Chen
Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty
2025-07-14tools/sched: Add dl_bw_dump.py for printing bandwidth accounting infoJuri Lelli
dl_rq bandwidth accounting information is crucial for the correct functioning of SCHED_DEADLINE. Add a drgn script for accessing that information at runtime, so that it's easier to check and debug issues related to it. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-6-juri.lelli@redhat.com
2025-07-14tools/sched: Add root_domains_dump.py which dumps root domains infoJuri Lelli
Root domains information is somewhat hard to access at runtime. Even with sched_debug and sched_verbose, such information is only printed on kernel console when domains are modified. Add a simple drgn script to more easily retrieve root domains information at runtime. Since tools/sched is a new directory, add it to MAINTAINERS as well. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-5-juri.lelli@redhat.com
2025-07-14sched/deadline: Fix accounting after global limits changeJuri Lelli
A global limits change (sched_rt_handler() logic) currently leaves stale and/or incorrect values in variables related to accounting (e.g. extra_bw). Properly clean up per runqueue variables before implementing the change and rebuild scheduling domains (so that accounting is also properly restored) after such a change is complete. Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-4-juri.lelli@redhat.com
2025-07-14sched/deadline: Reset extra_bw to max_bw when clearing root domainsJuri Lelli
dl_clear_root_domain() doesn't take into account the fact that per-rq extra_bw variables retain values computed before root domain changes, resulting in broken accounting. Fix it by resetting extra_bw to max_bw before restoring back dl-servers contributions. Fixes: 2ff899e351643 ("sched/deadline: Rebuild root domain accounting after every update") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-3-juri.lelli@redhat.com
2025-07-14sched/deadline: Initialize dl_servers after SMPJuri Lelli
dl-servers are currently initialized too early at boot when CPUs are not fully up (only boot CPU is). This results in miscalculation of per runqueue DEADLINE variables like extra_bw (which needs a stable CPU count). Move initialization of dl-servers later on after SMP has been initialized and CPUs are all online, so that CPU count is stable and DEADLINE variables can be computed correctly. Fixes: d741f297bceaf ("sched/fair: Fair server interface") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
2025-07-14sched: Change nr_uninterruptible type to unsigned longAruna Ramakrishna
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed nr_uninterruptible to an unsigned int. But the nr_uninterruptible values for each of the CPU runqueues can grow to large numbers, sometimes exceeding INT_MAX. This is valid, if, over time, a large number of tasks are migrated off of one CPU after going into an uninterruptible state. Only the sum of all nr_interruptible values across all CPUs yields the correct result, as explained in a comment in kernel/sched/loadavg.c. Change the type of nr_uninterruptible back to unsigned long to prevent overflows, and thus the miscalculation of load average. Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
2025-07-13Linux 6.16-rc6v6.16-rc6Linus Torvalds
2025-07-13Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Fixes for a few clk drivers and bindings: - Add a missing property to the Mediatek MT8188 clk binding to keep binding checks happy - Avoid an OOB by setting the correct number of parents in dispmix_csr_clk_dev_data - Allocate clk_hw structs early in probe to avoid an ordering issue where clk_parent_data points to an unallocated clk_hw when the child clk is registered before the parent clk in the SCMI clk driver * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: dt-bindings: clock: mediatek: Add #reset-cells property for MT8188 clk: imx: Fix an out-of-bounds access in dispmix_csr_clk_dev_data clk: scmi: Handle case where child clocks are initialized before their parents
2025-07-13Merge tag 'x86_urgent_for_v6.16_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Update Kirill's email address - Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole lotta sense on 32-bit - Add fixes for a misconfigured AMD Zen2 client which wasn't even supposed to run Linux * tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Update Kirill Shutemov's email address for TDX x86/mm: Disable hugetlb page table sharing on 32-bit x86/CPU/AMD: Disable INVLPGB on Zen2 x86/rdrand: Disable RDSEED on AMD Cyan Skillfish
2025-07-13Merge tag 'irq_urgent_for_v6.16_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a case of recursive locking in the MSI code - Fix a randconfig build failure in armada-370-xp irqchip * tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/irq-msi-lib: Fix build with PCI disabled PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()
2025-07-13Merge tag 'perf_urgent_for_v6.16_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Prevent perf_sigtrap() from observing an exiting task and warning about it * tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix WARN in perf_sigtrap()
2025-07-12Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. A whopping 16 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 14 are for MM. Three gdb-script fixes and a kallsyms build fix" * tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "sched/numa: add statistics of numa balance task" mm: fix the inaccurate memory statistics issue for users mm/damon: fix divide by zero in damon_get_intervals_score() samples/damon: fix damon sample mtier for start failure samples/damon: fix damon sample wsse for start failure samples/damon: fix damon sample prcl for start failure kasan: remove kasan_find_vm_area() to prevent possible deadlock scripts: gdb: vfs: support external dentry names mm/migrate: fix do_pages_stat in compat mode mm/damon/core: handle damon_call_control as normal under kdmond deactivation mm/rmap: fix potential out-of-bounds page table access during batched unmap mm/hugetlb: don't crash when allocating a folio if there are no resv scripts/gdb: de-reference per-CPU MCE interrupts scripts/gdb: fix interrupts.py after maple tree conversion maple_tree: fix mt_destroy_walk() on root leaf node mm/vmalloc: leave lazy MMU mode on PTE mapping error scripts/gdb: fix interrupts display after MCP on x86 lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users() kallsyms: fix build without execinfo
2025-07-12Merge tag 'erofs-for-6.16-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Fix for a cache aliasing issue by adding missing flush_dcache_folio(), which causes execution failures on some arm32 setups. Fix for large compressed fragments, which could be generated by -Eall-fragments option (but should be rare) and was rejected by mistake due to an on-disk hardening commit. The remaining ones are small fixes. Summary: - Address cache aliasing for mappable page cache folios - Allow readdir() to be interrupted - Fix large fragment handling which was errored out by mistake - Add missing tracepoints - Use memcpy_to_folio() to replace copy_to_iter() for inline data" * tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix large fragment handling erofs: allow readdir() to be interrupted erofs: address D-cache aliasing erofs: use memcpy_to_folio() to replace copy_to_iter() erofs: fix to add missing tracepoint in erofs_read_folio() erofs: fix to add missing tracepoint in erofs_readahead()
2025-07-12Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet. * tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs: bcachefs: Don't set BCH_FS_error on transaction restart bcachefs: Fix additional misalignment in journal space calculations bcachefs: Don't schedule non persistent passes persistently bcachefs: Fix bch2_btree_transactions_read() synchronization bcachefs: btree read retry fixes bcachefs: btree node scan no longer uses btree cache bcachefs: Tweak btree cache helpers for use by btree node scan bcachefs: Fix btree for nonexistent tree depth bcachefs: Fix bch2_io_failures_to_text() bcachefs: bch2_fpunch_snapshot()
2025-07-12Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - fix use after free in lease break - small fix for freeing rdma transport (fixes missing logging of cm_qp_destroy) - fix write count leak * tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix potential use-after-free in oplock/lease break ack ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked() smb: server: make use of rdma_destroy_qp()
2025-07-11Merge tag 'pci-v6.16-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull PCI fixes from Bjorn Helgaas: - Track apple Root Ports explicitly and look up the driver data from the struct device instead of using dev->driver_data, which is used by pci_host_common_init() for the generic host bridge pointer (Marc Zyngier) - Set dev->driver_data before pci_host_common_init() calls gen_pci_init() because some drivers need it to set up ECAM mappings; this fixes a regression on MicroChip MPFS Icicle (Geert Uytterhoeven) - Revert the now-unnecessary use of ECAM pci_config_window.priv to store a copy of dev->driver_data (Marc Zyngier) * tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: Revert "PCI: ecam: Allow cfg->priv to be pre-populated from the root port device" PCI: host-generic: Set driver_data before calling gen_pci_init() PCI: apple: Add tracking of probed root ports
2025-07-11Merge tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Simona Vetter: "Cross-subsystem Changes: - agp/amd64 binding dmesg noise regression fix Core Changes: - fix race in gem_handle_create_tail - fixup handle_count fb refcount regression from -rc5, popular with reports ... - call rust dtor for drm_device release Driver Changes: - nouveau: magic 50ms suspend fix, acpi leak fix - tegra: dma api error in nvdec - pvr: fix device reset - habanalbs maintainer update - intel display: fix some dsi mipi sequences - xe fixes: SRIOV fixes, small GuC fixes, disable indirect ring due to issues, compression fix for fragmented BO, doc update * tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel: (22 commits) drm/xe/guc: Default log level to non-verbose drm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VF drm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2 drm/xe/pm: Correct comment of xe_pm_set_vram_threshold() drm/xe: Release runtime pm for error path of xe_devcoredump_read() drm/xe/pm: Restore display pm if there is error after display suspend drm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences too drm/gem: Fix race in drm_gem_handle_create_tail() drm/framebuffer: Acquire internal references on GEM handles agp/amd64: Check AGP Capability before binding to unsupported devices drm/xe/bmg: fix compressed VRAM handling Revert "drm/xe/xe2: Enable Indirect Ring State support for Xe2" drm/xe: Allocate PF queue size on pow2 boundary drm/xe/pf: Clear all LMTT pages on alloc drm/nouveau/gsp: fix potential leak of memory used during acpi init rust: drm: remove unnecessary imports MAINTAINERS: Change habanalabs maintainer drm/imagination: Fix kernel crash when hard resetting the GPU drm/tegra: nvdec: Fix dma_alloc_coherent error check rust: drm: device: drop_in_place() the drm::Device in release() ...
2025-07-11Revert "eventpoll: Fix priority inversion problem"Linus Torvalds
This reverts commit 8c44dac8add7503c345c0f6c7962e4863b88ba42. I haven't figured out what the actual bug in this commit is, but I did spend a lot of time chasing it down and eventually succeeded in bisecting it down to this. For some reason, this eventpoll commit ends up causing delays and stuck user space processes, but it only happens on one of my machines, and only during early boot or during the flurry of initial activity when logging in. I must be triggering some very subtle timing issue, but once I figured out the behavior pattern that made it reasonably reliable to trigger, it did bisect right to this, and reverting the commit fixes the problem. Of course, that was only after I had failed at bisecting it several times, and had flailed around blaming both the drm people and the netlink people for the odd problems. The most obvious of which happened at the time of the first graphical login (the most common symptom being that some gnome app aborted due to a 30s timeout, often leading to the whole session then failing if it was some critical component like gnome-shell or similar). Acked-by: Nam Cao <namcao@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-12erofs: fix large fragment handlingGao Xiang
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE but the fragment is not the whole inode, it currently returns -EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set. It is not intended by design but should be rare, as it can only be reproduced by mkfs with `-Eall-fragments` in a specific case. Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT. Reported-by: Axel Fontaine <axel@axelfontaine.com> Closes: https://github.com/erofs/erofs-utils/issues/23 Fixes: 7c3ca1838a78 ("erofs: restrict pcluster size limitations") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
2025-07-11Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - MD changes via Yu: - fix UAF due to stack memory used for bio mempool (Jinchao) - fix raid10/raid1 nowait IO error path (Nigel and Qixing) - fix kernel crash from reading bitmap sysfs entry (Håkon) - Fix for a UAF in the nbd connect error path - Fix for blocksize being bigger than pagesize, if THP isn't enabled * tag 'block-6.16-20250710' of git://git.kernel.dk/linux: block: reject bs > ps block devices when THP is disabled nbd: fix uaf in nbd_genl_connect() error path md/md-bitmap: fix GPF in bitmap_get_stats() md/raid1,raid10: strip REQ_NOWAIT from member bios raid10: cleanup memleak at raid10_make_request md/raid1: Fix stack memory use after return in raid1_reshape
2025-07-11Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Remove a pointless warning in the zcrx code - Fix for MSG_RING commands, where the allocated io_kiocb needs to be freed under RCU as well - Revert the work-around we had in place for the anon inodes pretending to be regular files. Since that got reworked upstream, the work-around is no longer needed * tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux: Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU io_uring/zcrx: fix pp destruction warnings
2025-07-11Merge tag 'net-6.16-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull more networking fixes from Jakub Kicinski "Big chunk of fixes for WiFi, Johannes says probably the last for the release. The Netlink fixes (on top of the tree) restore operation of iw (WiFi CLI) which uses sillily small recv buffer, and is the reason for this 'emergency PR'. The GRE multicast fix also stands out among the user-visible regressions. Current release - fix to a fix: - netlink: make sure we always allow at least one skb to be queued, even if the recvbuf is (mis)configured to be tiny Previous releases - regressions: - gre: fix IPv6 multicast route creation Previous releases - always broken: - wifi: prevent A-MSDU attacks in mesh networks - wifi: cfg80211: fix S1G beacon head validation and detection - wifi: mac80211: - always clear frame buffer to prevent stack leak in cases which hit a WARN() - fix monitor interface in device restart - wifi: mwifiex: discard erroneous disassoc frames on STA interface - wifi: mt76: - prevent null-deref in mt7925_sta_set_decap_offload() - add missing RCU annotations, and fix sleep in atomic - fix decapsulation offload - fixes for scanning - phy: microchip: improve link establishment and reset handling - eth: mlx5e: fix race between DIM disable and net_dim() - bnxt_en: correct DMA unmap len for XDP_REDIRECT" * tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits) netlink: make sure we allow at least one dump skb netlink: Fix rmem check in netlink_broadcast_deliver(). bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT bnxt_en: Flush FW trace before copying to the coredump bnxt_en: Fix DCB ETS validation net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam() net/mlx5e: Add new prio for promiscuous mode net/mlx5e: Fix race between DIM disable and net_dim() net/mlx5: Reset bw_share field when changing a node's parent can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level selftests: net: lib: fix shift count out of range selftests: Add IPv6 multicast route generation tests for GRE devices. gre: Fix IPv6 multicast route creation. net: phy: microchip: limit 100M workaround to link-down events on LAN88xx net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof net: appletalk: Fix device refcount leak in atrtr_create() netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto() wifi: mac80211: add the virtual monitor after reconfig complete wifi: mac80211: always initialize sdata::key_list ...
2025-07-11Merge tag 'gpio-fixes-for-v6.16-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix performance regression when setting values of multiple GPIO lines at once - make sure the GPIO OF xlate code doesn't end up passing an uninitialized local variable to GPIO core - update MAINTAINERS * tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: MAINTAINERS: remove bouncing address for Nandor Han gpio: of: initialize local variable passed to the .of_xlate() callback gpiolib: fix performance regression when using gpio_chip_get_multiple()
2025-07-11Merge tag 'pm-6.16-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix a coding mistake in a previous fix related to system suspend and hibernation merged recently" * tag 'pm-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Call pm_restore_gfp_mask() after dpm_resume()
2025-07-11Merge tag 'dma-mapping-6.16-2025-07-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - small fix relevant to arm64 server and custom CMA configuration (Feng Tang) * tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-contiguous: hornor the cma address limit setup by user
2025-07-11netlink: make sure we allow at least one dump skbJakub Kicinski
Commit under Fixes tightened up the memory accounting for Netlink sockets. Looks like the accounting is too strict for some existing use cases, Marek reported issues with nl80211 / WiFi iw CLI. To reduce number of iterations Netlink dumps try to allocate messages based on the size of the buffer passed to previous recvmsg() calls. If user space uses a larger buffer in recvmsg() than sk_rcvbuf we will allocate an skb we won't be able to queue. Make sure we always allow at least one skb to be queued. Same workaround is already present in netlink_attachskb(). Alternative would be to cap the allocation size to rcvbuf - rmem_alloc but as I said, the workaround is already present in other places. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/9794af18-4905-46c6-b12c-365ea2f05858@samsung.com Fixes: ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.") Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711001121.3649033-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11netlink: Fix rmem check in netlink_broadcast_deliver().Kuniyuki Iwashima
We need to allow queuing at least one skb even when skb is larger than sk->sk_rcvbuf. The cited commit made a mistake while converting a condition in netlink_broadcast_deliver(). Let's correct the rmem check for the allow-one-skb rule. Fixes: ae8f160e7eb24 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711053208.2965945-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11Merge branch 'bnxt_en-3-bug-fixes'Jakub Kicinski
Michael Chan says: ==================== bnxt_en: 3 bug fixes The first one fixes a possible failure when setting DCB ETS. The second one fixes the ethtool coredump (-W 2) not containing all the FW traces. The third one fixes the DMA unmap length when transmitting XDP_REDIRECT packets. ==================== Link: https://patch.msgid.link/20250710213938.1959625-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11bnxt_en: Set DMA unmap len correctly for XDP_REDIRECTSomnath Kotur
When transmitting an XDP_REDIRECT packet, call dma_unmap_len_set() with the proper length instead of 0. This bug triggers this warning on a system with IOMMU enabled: WARNING: CPU: 36 PID: 0 at drivers/iommu/dma-iommu.c:842 __iommu_dma_unmap+0x159/0x170 RIP: 0010:__iommu_dma_unmap+0x159/0x170 Code: a8 00 00 00 00 48 c7 45 b0 00 00 00 00 48 c7 45 c8 00 00 00 00 48 c7 45 a0 ff ff ff ff 4c 89 45 b8 4c 89 45 c0 e9 77 ff ff ff <0f> 0b e9 60 ff ff ff e8 8b bf 6a 00 66 66 2e 0f 1f 84 00 00 00 00 RSP: 0018:ff22d31181150c88 EFLAGS: 00010206 RAX: 0000000000002000 RBX: 00000000e13a0000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ff22d31181150cf0 R08: ff22d31181150ca8 R09: 0000000000000000 R10: 0000000000000000 R11: ff22d311d36c9d80 R12: 0000000000001000 R13: ff13544d10645010 R14: ff22d31181150c90 R15: ff13544d0b2bac00 FS: 0000000000000000(0000) GS:ff13550908a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005be909dacff8 CR3: 0008000173408003 CR4: 0000000000f71ef0 PKRU: 55555554 Call Trace: <IRQ> ? show_regs+0x6d/0x80 ? __warn+0x89/0x160 ? __iommu_dma_unmap+0x159/0x170 ? report_bug+0x17e/0x1b0 ? handle_bug+0x46/0x90 ? exc_invalid_op+0x18/0x80 ? asm_exc_invalid_op+0x1b/0x20 ? __iommu_dma_unmap+0x159/0x170 ? __iommu_dma_unmap+0xb3/0x170 iommu_dma_unmap_page+0x4f/0x100 dma_unmap_page_attrs+0x52/0x220 ? srso_alias_return_thunk+0x5/0xfbef5 ? xdp_return_frame+0x2e/0xd0 bnxt_tx_int_xdp+0xdf/0x440 [bnxt_en] __bnxt_poll_work_done+0x81/0x1e0 [bnxt_en] bnxt_poll+0xd3/0x1e0 [bnxt_en] Fixes: f18c2b77b2e4 ("bnxt_en: optimized XDP_REDIRECT support") Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11bnxt_en: Flush FW trace before copying to the coredumpShruti Parab
bnxt_fill_drv_seg_record() calls bnxt_dbg_hwrm_log_buffer_flush() to flush the FW trace buffer. This needs to be done before we call bnxt_copy_ctx_mem() to copy the trace data. Without this fix, the coredump may not contain all the FW traces. Fixes: 3c2179e66355 ("bnxt_en: Add FW trace coredump segments to the coredump") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11bnxt_en: Fix DCB ETS validationShravya KN
In bnxt_ets_validate(), the code incorrectly loops over all possible traffic classes to check and add the ETS settings. Fix it to loop over the configured traffic classes only. The unconfigured traffic classes will default to TSA_ETS with 0 bandwidth. Looping over these unconfigured traffic classes may cause the validation to fail and trigger this error message: "rejecting ETS config starving a TC\n" The .ieee_setets() will then fail. Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.") Reviewed-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()Alok Tiwari
The function ll_temac_ethtools_set_ringparam() incorrectly checked rx_pending twice, once correctly for RX and once mistakenly in place of tx_pending. This caused tx_pending to be left unchecked against TX_BD_NUM_MAX. As a result, invalid TX ring sizes may have been accepted or valid ones wrongly rejected based on the RX limit, leading to potential misconfiguration or unexpected results. This patch corrects the condition to properly validate tx_pending. Fixes: f7b261bfc35e ("net: ll_temac: Make RX/TX ring sizes configurable") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20250710180621.2383000-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11Merge branch 'mlx5-misc-fixes-2025-07-10'Jakub Kicinski
Tariq Toukan says: ==================== mlx5 misc fixes 2025-07-10 This small patchset provides misc bug fixes from the team to the mlx5 core and EN drivers. ==================== Link: https://patch.msgid.link/1752155624-24095-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5e: Add new prio for promiscuous modeJianbo Liu
An optimization for promiscuous mode adds a high-priority steering table with a single catch-all rule to steer all traffic directly to the TTC table. However, a gap exists between the creation of this table and the insertion of the catch-all rule. Packets arriving in this brief window would miss as no rule was inserted yet, unnecessarily incrementing the 'rx_steer_missed_packets' counter and dropped. This patch resolves the issue by introducing a new prio for this table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing so, packets arriving during the window now fall through to the next prio (at MLX5E_NIC_PRIO) instead of being dropped. Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5e: Fix race between DIM disable and net_dim()Carolina Jubran
There's a race between disabling DIM and NAPI callbacks using the dim pointer on the RQ or SQ. If NAPI checks the DIM state bit and sees it still set, it assumes `rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after that check, the pointer might already be set to NULL, leading to a NULL pointer dereference in net_dim(). Fix this by calling `synchronize_net()` before freeing the DIM context. This ensures all in-progress NAPI callbacks are finished before the pointer is cleared. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:net_dim+0x23/0x190 ... Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x150/0x3e0 ? common_interrupt+0xf/0xa0 ? sysvec_call_function_single+0xb/0x90 ? exc_page_fault+0x74/0x130 ? asm_exc_page_fault+0x22/0x30 ? net_dim+0x23/0x190 ? mlx5e_poll_ico_cq+0x41/0x6f0 [mlx5_core] ? sysvec_apic_timer_interrupt+0xb/0x90 mlx5e_handle_rx_dim+0x92/0xd0 [mlx5_core] mlx5e_napi_poll+0x2cd/0xac0 [mlx5_core] ? mlx5e_poll_ico_cq+0xe5/0x6f0 [mlx5_core] busy_poll_stop+0xa2/0x200 ? mlx5e_napi_poll+0x1d9/0xac0 [mlx5_core] ? mlx5e_trigger_irq+0x130/0x130 [mlx5_core] __napi_busy_loop+0x345/0x3b0 ? sysvec_call_function_single+0xb/0x90 ? asm_sysvec_call_function_single+0x16/0x20 ? sysvec_apic_timer_interrupt+0xb/0x90 ? pcpu_free_area+0x1e4/0x2e0 napi_busy_loop+0x11/0x20 xsk_recvmsg+0x10c/0x130 sock_recvmsg+0x44/0x70 __sys_recvfrom+0xbc/0x130 ? __schedule+0x398/0x890 __x64_sys_recvfrom+0x20/0x30 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ... ---[ end trace 0000000000000000 ]--- ... ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: 445a25f6e1a2 ("net/mlx5e: Support updating coalescing configuration without resetting channels") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5: Reset bw_share field when changing a node's parentCarolina Jubran
When changing a node's parent, its scheduling element is destroyed and re-created with bw_share 0. However, the node's bw_share field was not updated accordingly. Set the node's bw_share to 0 after re-creation to keep the software state in sync with the firmware configuration. Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11MAINTAINERS: Update Kirill Shutemov's email address for TDXKirill A. Shutemov
Update MAINTAINERS to use my @kernel.org email address. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250708101922.50560-4-kirill.shutemov%40linux.intel.com