summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-08-02mm/damon/vaddr: skip isolating folios already in destination nidBijan Tabatabai
damos_va_migrate_dests_add() determines the node a folio should be in based on the struct damos_migrate_dests associated with the migration scheme and adds the folio to the linked list corresponding to that node so it can be migrated later. Currently, folios are isolated and added to the list even if they are already in the node they should be in. In using damon weighted interleave more, I've found that the overhead of needlessly adding these folios to the migration lists can be quite high. The overhead comes from isolating folios and placing them in the migration lists inside of damos_va_migrate_dests_add(), as well as the cost of handling those folios in damon_migrate_pages(). This patch eliminates that overhead by simply avoiding the addition of folios that are already in their intended location to the migration list. To show the benefit of this patch, we start the test workload and start a DAMON instance attached to that workload with a migrate_hot scheme that has one dest field sending data to the local node. This way, we are only measuring the overheads of the scheme, and not the cost of migrating pages, since data will be allocated to the local node by default. I tested with two workloads: the embedding reduction workload used in [1] and a microbenchmark that allocates 20GB of data then sleeps, which is similar to the memory usage of the embedding reduction workload. The time taken in damos_va_migrate_dests_add() and damon_migrate_pages() each aggregation interval is shown below. Before this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark ~2ms ~3ms embedding reduction ~1s ~3s After this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark 0us ~40us embedding reduction 0us ~100us I did not do an in depth analysis for why things are much slower in the embedding reduction workload than the microbenchmark. However, I assume it's because the embedding reduction workload oversaturates the bandwidth of the local memory node, increasing the memory access latency, and in turn making the pointer chasing involved in iterating through a linked list much slower. Regardless of that, this patch results in a significant speedup. [1] https://lore.kernel.org/damon/20250709005952.17776-1-bijan311@gmail.com/ Link: https://lkml.kernel.org/r/20250725163300.4602-1-bijan311@gmail.com Fixes: 19c1dc15c859 ("mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02selftests: cachestat: add tests for mmap, refactor and enhance mmap test for ↵Suresh K C
cachestat validation Add a cohesive test case that verifies cachestat behavior with memory-mapped files using mmap(). Also refactor the test logic to reduce redundancy, improve error reporting, and clarify failure messages for both shmem and mmap file types. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250709174657.6916-1-suresh.k.chandrappa@gmail.com Signed-off-by: Suresh K C <suresh.k.chandrappa@gmail.com> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: add process info to bad rss-counter warningXuanye Liu
Enhance the debugging information in check_mm() by including the process name and PID when reporting bad rss-counter states. This helps identify which process is associated with the memory accounting issue. Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu <liuqiye2025@163.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kasan: skip quarantine if object is still accessible under RCUJann Horn
Currently, enabling KASAN masks bugs where a lockless lookup path gets a pointer to a SLAB_TYPESAFE_BY_RCU object that might concurrently be recycled and is insufficiently careful about handling recycled objects: KASAN puts freed objects in SLAB_TYPESAFE_BY_RCU slabs onto its quarantine queues, even when it can't actually detect UAF in these objects, and the quarantine prevents fast recycling. When I introduced CONFIG_SLUB_RCU_DEBUG, my intention was that enabling CONFIG_SLUB_RCU_DEBUG should cause KASAN to mark such objects as freed after an RCU grace period and put them on the quarantine, while disabling CONFIG_SLUB_RCU_DEBUG should allow such objects to be reused immediately; but that hasn't actually been working. I discovered such a UAF bug involving SLAB_TYPESAFE_BY_RCU yesterday; I could only trigger this bug in a KASAN build by disabling CONFIG_SLUB_RCU_DEBUG and applying this patch. Link: https://lkml.kernel.org/r/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/page-flags: remove folio_start_writeback_keepwrite()Joanne Koong
Commit cd57b77197a4 ("ext4: Convert ext4_bio_write_page() to use a folio) removed set_page_writeback_keepwrite() which was the last/only caller of folio_start_writeback_keepwrite(). Link: https://lkml.kernel.org/r/20250722182230.2114587-1-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02selftests/mm: add process_madvise() testswang lian
Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. [lianux.mm@gmail.com: v7] Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250721114614.40996-1-lianux.mm@gmail.com Signed-off-by: wang lian <lianux.mm@gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Zi Yan <ziy@nvidia.com> Suggested-by: Mark Brown <broonie@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Zi Yan <ziy@nvidia.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm: shmem: fix the shmem large folio allocation for the i915 driverBaolin Wang
After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we extend the 'huge=' option to allow any sized large folios for tmpfs, which means tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths, and then will try each allowable large order. However, when the i915 driver allocates shmem memory, it doesn't provide hint information about the size of the large folio to be allocated, resulting in the inability to allocate PMD-sized shmem, which in turn affects GPU performance. Patryk added: : In my tests, the performance drop ranges from a few percent up to 13% : in Unigine Superposition under heavy memory usage on the CPU Core Ultra : 155H with the Xe 128 EU GPU. Other users have reported performance : impact up to 30% on certain workloads. Please find more in the : regressions reports: : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645 : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845 : : I believe the change should be backported to all active kernel branches : after version 6.12. To fix this issue, we can use the inode's size as a write size hint in shmem_read_folio_gfp() to help allocate PMD-sized large folios. Link: https://lkml.kernel.org/r/f7e64e99a3a87a8144cc6b2f1dddf7a89c12ce44.1753926601.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Suggested-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02tools/getdelays: add backward compatibility for taskstats versionFan Yu
Add version checks to print_delayacct() to handle differences in struct taskstats across kernel versions. Field availability depends on taskstats version (t->version), corresponding to TASKSTATS_VERSION in kernel headers (see include/uapi/linux/taskstats.h). Version feature mapping: - version >= 11 - supports COMPACT statistics - version >= 13 - supports WPCOPY statistics - version >= 14 - supports IRQ statistics - version >= 16 - supports *_max and *_min delay statistics This ensures the tool works correctly with both older and newer kernel versions by conditionally printing fields based on the reported version. eg.1 bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h "#define TASKSTATS_VERSION 10" bash# ./getdelays -d -p 1 CPU count real total virtual total delay total delay average 7481 3786181709 3807098291 36393725 0.005ms IO count delay total delay average 369 1116046035 3.025ms SWAP count delay total delay average 0 0 0.000ms RECLAIM count delay total delay average 0 0 0.000ms THRASHING count delay total delay average 0 0 0.000ms eg.2 bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h "#define TASKSTATS_VERSION 14" bash# ./getdelays -d -p 1 CPU count real total virtual total delay total delay average 68862 163474790046 174584722267 19962496806 0.290ms IO count delay total delay average 0 0 0.000ms SWAP count delay total delay average 0 0 0.000ms RECLAIM count delay total delay average 0 0 0.000ms THRASHING count delay total delay average 0 0 0.000ms COMPACT count delay total delay average 0 0 0.000ms WPCOPY count delay total delay average 0 0 0.000ms IRQ count delay total delay average 0 0 0.000ms Link: https://lkml.kernel.org/r/20250731225326549CttJ7g9NfjTlaqBwl015T@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kho: add test for kexec handoverMike Rapoport (Microsoft)
Testing kexec handover requires a kernel driver that will generate some data and preserve it with KHO on the first boot and then restore that data and verify it was preserved properly after kexec. To facilitate such test, along with the kernel driver responsible for data generation, preservation and restoration add a script that runs a kernel in a VM with a minimal /init. The /init enables KHO, loads a kernel image for kexec and runs kexec reboot. After the boot of the kexeced kernel, the driver verifies that the data was properly preserved. [rppt@kernel.org: fix section mismatch] Link: https://lkml.kernel.org/r/aIiRC8fXiOXKbPM_@kernel.org Link: https://lkml.kernel.org/r/20250727083733.2590139-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02delaytop: enhance error logging and add PSI feature descriptionfan.yu9@zte.com.cn
This patch improves error diagnostics and documentation for delaytop: 1) Enhanced error logging: - Added explicit error messages in critical failure paths - Implemented BOOL_FPRINT macro for robust output handling 2) PSI feature documentation: - Updated header comment to reflect PSI monitoring capability - Improved output formatting for PSI information System Pressure Information: (avg10/avg60/avg300/total) CPU some: 0.0%/ 0.0%/ 0.0%/ 345(ms) CPU full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Memory full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Memory some: 0.0%/ 0.0%/ 0.0%/ 0(ms) IO full: 0.0%/ 0.0%/ 0.0%/ 65(ms) IO some: 0.0%/ 0.0%/ 0.0%/ 79(ms) IRQ full: 0.0%/ 0.0%/ 0.0%/ 0(ms) Link: https://lkml.kernel.org/r/202507281628341752gMXCMN7S-Vz_LHYHum9r@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Acked-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02samples: Kconfig: fix spelling mistake "instancess" -> "instances"Colin Ian King
There is a spelling mistake in the SAMPLE_TRACE_ARRAY config. Fix it. Link: https://lkml.kernel.org/r/20250724111715.141826-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02fat: fix too many log in fat_chain_add()OGAWA Hirofumi
This log was excessive for a serial console. So use the ratelimited version instead. Link: https://lkml.kernel.org/r/87qzy611d9.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+fa7ef54f66c189c04b73@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fa7ef54f66c189c04b73 Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02scripts/spelling.txt: add notifer||notifier to spelling.txtWangYuli
This typo was not listed in scripts/spelling.txt, thus it was more difficult to detect. Add it for convenience. Link: https://lkml.kernel.org/r/02153C05ED7B49B7+20250722073431.21983-8-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02xen/xenbus: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/C6633C66376C709A+20250722073431.21983-7-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02net: mvneta: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/0CB4300CB6F49007+20250722073431.21983-4-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02drm/xe: fix typo "notifer"WangYuli
There is a spelling mistake of 'notifer' in the comment which should be 'notifier'. Link: https://lkml.kernel.org/r/94190C5F54A19F3E+20250722073431.21983-3-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02cxl: mce: fix typo "notifer"WangYuli
According to the context, "mce_notifer" should be "mce_notifier". Link: https://lkml.kernel.org/r/E1EB1BA9FDF07D53+20250722073431.21983-2-wangyuli@uniontech.com Fixes: 516e5bd0b6bf ("cxl: Add mce notifier to emit aliased address for extended linear cache") Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02KVM: x86: fix typo "notifer"WangYuli
Patch series "treewide: Fix typo "notifer"", v3. There are some spelling mistakes of 'notifer' in comments which should be 'notifier'. Fix them and add it to scripts/spelling.txt. This patch (of 8): There are some spelling mistakes of 'notifer' which should be 'notifier'. Link: https://lkml.kernel.org/r/576F0D85F6853074+20250722072734.19367-1-wangyuli@uniontech.com Link: https://lkml.kernel.org/r/7F05778C3A1A9F8B+20250722073431.21983-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02MAINTAINERS: add maintainers for delaytopWang Yaxin
The delaytop tool supports showing system delays and task-level delays, effectively identifying the top-n tasks with high latency in the system, which is highly beneficial for improving system performance. Wang Yaxin and her colleague Fan Yu focus on locating system delay issues. To promote the thriving development of delaytop, we hope to serve as maintainers to continuously improve it, aiming to provide a more effective solution for system latency issues in the future. Link: https://lkml.kernel.org/r/20250721094049958ImB8XG_imntcPqpQn1KfG@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()Uros Bizjak
Use atomic_long_try_cmpxchg() instead of atomic_long_cmpxchg (*ptr, old, new) == old in atomic_long_inc_below(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02ucount: fix atomic_long_inc_below() argument typeUros Bizjak
The type of u argument of atomic_long_inc_below() should be long to avoid unwanted truncation to int. The patch fixes the wrong argument type of an internal function to prevent unwanted argument truncation. It fixes an internal locking primitive; it should not have any direct effect on userspace. Mark said : AFAICT there's no problem in practice because atomic_long_inc_below() : is only used by inc_ucount(), and it looks like the value is : constrained between 0 and INT_MAX. : : In inc_ucount() the limit value is taken from : user_namespace::ucount_max[], and AFAICT that's only written by : sysctls, to the table setup by setup_userns_sysctls(), where : UCOUNT_ENTRY() limits the value between 0 and INT_MAX. : : This is certainly a cleanup, but there might be no functional issue in : practice as above. Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02kexec: enable CMA based contiguous allocationAlexander Graf
When booting a new kernel with kexec_file, the kernel picks a target location that the kernel should live at, then allocates random pages, checks whether any of those patches magically happens to coincide with a target address range and if so, uses them for that range. For every page allocated this way, it then creates a page list that the relocation code - code that executes while all CPUs are off and we are just about to jump into the new kernel - copies to their final memory location. We can not put them there before, because chances are pretty good that at least some page in the target range is already in use by the currently running Linux environment. Copying is happening from a single CPU at RAM rate, which takes around 4-50 ms per 100 MiB. All of this is inefficient and error prone. To successfully kexec, we need to quiesce all devices of the outgoing kernel so they don't scribble over the new kernel's memory. We have seen cases where that does not happen properly (*cough* GIC *cough*) and hence the new kernel was corrupted. This started a month long journey to root cause failing kexecs to eventually see memory corruption, because the new kernel was corrupted severely enough that it could not emit output to tell us about the fact that it was corrupted. By allocating memory for the next kernel from a memory range that is guaranteed scribbling free, we can boot the next kernel up to a point where it is at least able to detect corruption and maybe even stop it before it becomes severe. This increases the chance for successful kexecs. Since kexec got introduced, Linux has gained the CMA framework which can perform physically contiguous memory mappings, while keeping that memory available for movable memory when it is not needed for contiguous allocations. The default CMA allocator is for DMA allocations. This patch adds logic to the kexec file loader to attempt to place the target payload at a location allocated from CMA. If successful, it uses that memory range directly instead of creating copy instructions during the hot phase. To ensure that there is a safety net in case anything goes wrong with the CMA allocation, it also adds a flag for user space to force disable CMA allocations. Using CMA allocations has two advantages: 1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the hot phase. 2) More robust. Even if by accident some page is still in use for DMA, the new kernel image will be safe from that access because it resides in a memory region that is considered allocated in the old kernel and has a chance to reinitialize that component. Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com Signed-off-by: Alexander Graf <graf@amazon.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02stackdepot: make max number of pools boot-time configurableMatt Fleming
We're hitting the WARN in depot_init_pool() about reaching the stack depot limit because we have long stacks that don't dedup very well. Introduce a new start-up parameter to allow users to set the number of maximum stack depot pools. Link: https://lkml.kernel.org/r/20250718153928.94229-1-matt@readmodwrite.com Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02lib/xxhash: remove unused functionsDr. David Alan Gilbert
xxh32_digest() and xxh32_update() were added in 2017 in the original xxhash commit, but have remained unused. Remove them. Link: https://lkml.kernel.org/r/20250716133245.243363-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Gilbert <linux@treblig.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02init/Kconfig: restore CONFIG_BROKEN help textAndrew Morton
Linus added it in 2003, it later was removed. Put it back. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02mm/shmem, swap: improve cached mTHP handling and fix potential hangKairui Song
The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve performance, as it avoids two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02Merge tag 'fbdev-for-6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev updates from Helge Deller: "One potential buffer overflow fix in the framebuffer registration function, some fixes for the imxfb, nvidiafb and simplefb drivers, and a bunch of cleanups for fbcon, kyrofb and svgalib. Framework fixes: - fix potential buffer overflow in do_register_framebuffer() [Yongzhen Zhang] Driver fixes: - imxfb: prevent null-ptr-deref [Chenyuan Yang] - nvidiafb: fix build on 32-bit ARCH=um [Johannes Berg] - nvidiafb: add depends on HAS_IOPORT [Randy Dunlap] - simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" [Rob Herring] Cleanups: - fbcon: various code cleanups wrt blinking [Ville Syrjälä] - kyrofb: Convert to devm_*() functions [Giovanni Di Santi] - svgalib: Coding style cleanups [Darshan R.] - Fix typo in Kconfig text for FB_DEVICE [Daniel Palmer]" * tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbcon: Use 'bool' where appopriate fbcon: Introduce get_{fg,bg}_color() fbcon: fbcon_is_inactive() -> fbcon_is_active() fbcon: fbcon_cursor_noblink -> fbcon_cursor_blink fbdev: Fix typo in Kconfig text for FB_DEVICE fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref fbdev: svgalib: Clean up coding style fbdev: kyro: Use devm_ioremap_wc() for screen mem fbdev: kyro: Use devm_ioremap() for mmio registers fbdev: kyro: Add missing PCI memory region request fbdev: simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" fbdev: fix potential buffer overflow in do_register_framebuffer() fbdev: nvidiafb: add depends on HAS_IOPORT fbdev: nvidiafb: fix build on 32-bit ARCH=um
2025-08-02Merge tag 'firewire-updates-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire updates from Takashi Sakamoto: "This update replaces the remaining tasklet usage in the FireWire subsystem with workqueue for asynchronous packet transmission. With this change, tasklets are now fully eliminated from the subsystem. Asynchronous packet transmission is used for serial bus topology management as well as for the operation of the SBP-2 protocol driver (firewire-sbp2). To ensure reliability during low-memory conditions, the associated workqueue is created with the WQ_MEM_RECLAIM flag, allowing it to participate in memory reclaim paths. Other attributes are aligned with those used for isochronous packet handling, which was migrated to workqueues in v6.12. The workqueues are sleepable and support preemptible work items, making them more suitable for real-time workloads that benefit from timely task preemption at the system level. There remains an issue where 'schedule()' may be called within an RCU read-side critical section, due to a direct replacement of 'tasklet_disable_in_atomic()' with 'disable_work_sync()'. A proposed fix for this has been posted[1], and is currently under review and testing. It is expected to be sent upstream later" Link: https://lore.kernel.org/lkml/20250728015125.17825-1-o-takashi@sakamocchi.jp/ [1] * tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: ohci: reduce the size of common context structure by extracting members into AT structure firewire: core: minor code refactoring to localize table of gap count firewire: ohci: use workqueue to handle events of AT request/response contexts firewire: ohci: use workqueue to handle events of AR request/response contexts firewire: core: allocate workqueue for AR/AT request/response contexts firewire: core: use from_work() macro to expand parent structure of work_struct firewire: ohci: use from_work() macro to expand parent structure of work_struct firewire: ohci: correct code comments about bus_reset tasklet
2025-08-02Merge tag 'bpf-next-6.17' into loongarch-nextHuacai Chen
LoongArch architecture changes for 6.17 have many bpf features such as trampoline, so merge 'bpf-next-6.17' to create a base to make bpf work well.
2025-08-01Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay Mohan, Mark Rutland, Maxwell Bland) - Disallow tail calls between BPF programs that use different cgroup local storage maps to prevent out-of-bounds access (Daniel Borkmann) - Fix unaligned access in flow_dissector and netfilter BPF programs (Paul Chaignon) - Avoid possible use of uninitialized mod_len in libbpf (Achill Gilgenast) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test for unaligned flow_dissector ctx access bpf: Improve ctx access verifier error message bpf: Check netfilter ctx accesses are aligned bpf: Check flow_dissector ctx accesses are aligned arm64/cfi,bpf: Support kCFI + BPF on arm64 cfi: Move BPF CFI types and helpers to generic code cfi: add C CFI type macro libbpf: Avoid possible use of uninitialized mod_len bpf: Fix oob access in cgroup local storage bpf: Move cgroup iterator helpers to bpf.h bpf: Move bpf map owner out of common struct bpf: Add cookie object to bpf maps
2025-08-01Merge tag 'perf-tools-for-v6.17-2025-08-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools updates from Namhyung Kim: "Build-ID processing goodies: Build-IDs are content based hashes to link regions of memory to ELF files in post processing. They have been available in distros for quite a while: $ file /bin/bash /bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a, for GNU/Linux 3.2.0, stripped It is possible to ask the kernel to get it from mmap executable backing storage at time they are being put in place and send it as metadata at that moment to have in perf.data. Prefer that across the board to speed up 'record' time - it post processes the samples to find binaries touched by any samples and to save them with build-ID. It can skip reading build-ID in userspace if it comes from the kernel. perf record: * Make --buildid-mmap default. The kernel can generate MMAP2 events with a build-ID from ELF header. Use that by default instead of using inode and device ID to identify binaries. It also can be disabled with --no-buildid-mmap. * Use BPF for -u/--uid option to sample processes belong to a user. BPF can track user processes more accurately and the existing logic often fails to get the list of processes due to race with reading the /proc filesystem. * Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and they have variables starting with "bpf_metadata_". This will help to identify BPF objects used in the profile. This has been supported in bpftool for some time and allows the recording of metadata such as commit hashes, versions, etc, that now gets recorded in perf.data as well. * Collect list of DSOs touched in the sample callchains as well as in the sample itself. This would increase the processing time at the end of record, but can improve the data quality. perf stat: * Add a new 'drm' pseudo-PMU support like in 'hwmon'. It can collect DRM usage stats using fdinfo in /proc. On my Intel laptop, it shows like below: $ perf list drm ... drm: drm-active-stolen-system0 [Total memory active in one or more engines. Unit: drm_i915] drm-active-system0 [Total memory active in one or more engines. Unit: drm_i915] drm-engine-capacity-video [Engine capacity. Unit: drm_i915] drm-engine-copy [Utilization in ns. Unit: drm_i915] drm-engine-render [Utilization in ns. Unit: drm_i915] drm-engine-video [Utilization in ns. Unit: drm_i915] ... $ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1 Performance counter stats for 'system wide': 48,137,316,988,873 ns drm-engine-render 34,452,696,746 ns drm-engine-video 20 capacity drm-engine-capacity-video 1.002086194 seconds time elapsed perf list * Add description for software events. The description is in JSON format and the event parser now can handle the software events like others (for example, it's case-insensitive and subject to wildcard matching). $ perf list software List of pre-defined events (to be used in -e or -M): software: alignment-faults [Number of kernel handled memory alignment faults. Unit: software] bpf-output [An event used by BPF programs to write to the perf ring buffer. Unit: software] cgroup-switches [Number of context switches to a task in a different cgroup. Unit: software] context-switches [Number of context switches [This event is an alias of cs]. Unit: software] cpu-clock [Per-CPU high-resolution timer based event. Unit: software] cpu-migrations [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software] cs [Number of context switches [This event is an alias of context-switches]. Unit: software] dummy [A placeholder event that doesn't count anything. Unit: software] emulation-faults [Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software] faults [Number of page faults [This event is an alias of page-faults]. Unit: software] major-faults [Number of major page faults. Major faults require I/O to handle. Unit: software] migrations [Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software] minor-faults [Number of minor page faults. Minor faults don't require I/O to handle. Unit: software] page-faults [Number of page faults [This event is an alias of faults]. Unit: software] task-clock [Per-task high-resolution timer based event. Unit: software] perf ftrace: * Add -e/--events option to perf ftrace latency to measure latency between the two events instead of a function. $ sudo perf ftrace latency -ab -e i915_request_wait_begin,i915_request_wait_end --hide-empty -- sleep 1 # DURATION | COUNT | GRAPH | 256 - 512 us | 4 | ###### | 2 - 4 ms | 2 | ### | 4 - 8 ms | 12 | ################### | 8 - 16 ms | 10 | ################ | # statistics (in usec) total time: 194915 avg time: 6961 max time: 12855 min time: 373 count: 28 * Add new function graph tracer options (--graph-opts) to display more info like arguments and return value. They will be passed to the kernel ftrace directly. $ sudo perf ftrace -G vfs_write --graph-opts retval,retaddr # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | ... 5) | mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */ 5) 0.188 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */ 5) | rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */ 5) | _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */ 5) 0.123 us | preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */ 5) 0.128 us | local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */ 5) 0.086 us | do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */ 5) 0.845 us | } /* _raw_spin_lock_irqsave ret=0x292 */ ... Misc: * Add perf archive --exclude-buildids <FILE> option to skip some binaries. The format of the FILE should be same as an output of perf buildid-list. * Get rid of dependency of libcrypto. It was just to get SHA-1 hash so implement it directly like in the kernel. A side effect is that it needs -fno-strict-aliasing compiler option (again, like in the kernel). * Convert all shell script tests to use bash" * tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits) perf record: Cache build-ID of hit DSOs only perf test: Ensure lock contention using pipe mode perf python: Stop using deprecated PyUnicode_AsString() perf list: Skip ABI PMUs when printing pmu values perf list: Remove tracepoint printing code perf tp_pmu: Add event APIs perf tp_pmu: Factor existing tracepoint logic to new file perf parse-events: Remove non-json software events perf jevents: Add common software event json perf tools: Remove libtraceevent in .gitignore perf test: Fix comment ordering perf sort: Use perf_env to set arch sort keys and header perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test perf sample: Remove arch notion of sample parsing perf env: Remove global perf_env perf trace: Avoid global perf_env with evsel__env perf auxtrace: Pass perf_env from session through to mmap read perf machine: Explicitly pass in host perf_env perf bench synthesize: Avoid use of global perf_env perf top: Make perf_env locally scoped ...
2025-08-01Merge tag 'parisc-for-6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc updates from Helge Deller: - The parisc kernel wrongly allows reading from read-protected userspace memory without faulting, e.g. when userspace uses mprotect() to read-protect a memory area and then uses a pointer to this memory in a write(2, addr, 1) syscall. To fix this issue, Dave Anglin developed a set of patches which use the proberi assembler instruction to additionally check read access permissions at runtime. - Randy Dunlap contributed two patches to fix a minor typo and to explain why a 32-bit compiler is needed although a 64-bit kernel is built * tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Revise __get_user() to probe user read access parisc: Revise gateway LWS calls to probe user read access parisc: Drop WARN_ON_ONCE() from flush_cache_vmap parisc: Try to fixup kernel exception in bad_area_nosemaphore path of do_page_fault() parisc: Define and use set_pte_at() parisc: Rename pte_needs_flush() to pte_needs_cache_flush() in cache.c parisc: Check region is readable by user in raw_copy_from_user() parisc: Update comments in make_insert_tlb parisc: Makefile: explain that 64BIT requires both 32-bit and 64-bit compilers parisc: Makefile: fix a typo in palo.conf
2025-08-01tracing: Have unsigned int function args displayed as hexadecimalSteven Rostedt
Most function arguments that are passed in as unsigned int or unsigned long are better displayed as hexadecimal than normal integer. For example, the functions: static void __create_object(unsigned long ptr, size_t size, int min_count, gfp_t gfp, unsigned int objflags); static bool stack_access_ok(struct unwind_state *state, unsigned long _addr, size_t len); void __local_bh_disable_ip(unsigned long ip, unsigned int cnt); Show up in the trace as: __create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs Instead, by displaying unsigned as hexadecimal, they look more like this: __create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs Which is much easier to understand as most unsigned longs are usually just pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks better as hexadecimal as a lot of flags are passed as unsigned. Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home - Use btf_int_encoding() instead of open coding it (Martin KaFai Lau) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01Merge tag 'cxl-for-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL updates from Dave Jiang: "The most significant changes in this pull request is the series that introduces ACQUIRE() and ACQUIRE_ERR() macros to replace conditional locking and ease the pain points of scoped_cond_guard(). The series also includes follow on changes that refactor the CXL sub-system to utilize the new macros. Detail summary: - Add documentation template for CXL conventions to document CXL platform quirks - Replace mutex_lock_io() with mutex_lock() for mailbox - Add location limit for fake CFMWS range for cxl_test, ARM platform enabling - CXL documentation typo and clarity fixes - Use correct format specifier for function cxl_set_ecs_threshold() - Make cxl_bus_type constant - Introduce new helper cxl_resource_contains_addr() to check address availability - Fix wrong DPA checking for PPR operation - Remove core/acpi.c and CXL core dependency on ACPI - Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks - Add CXL updates utilizing ACQUIRE() macro to remove gotos and improve readability - Add return for the dummy version of cxl_decoder_detach() without CONFIG_CXL_REGION - CXL events updates for spec r3.2 - Fix return of __cxl_decoder_detach() error path - CXL debugfs documentation fix" * tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (28 commits) Documentation/ABI/testing/debugfs-cxl: Add 'cxl' to clear_poison path cxl/region: Fix an ERR_PTR() vs NULL bug cxl/events: Trace Memory Sparing Event Record cxl/events: Add extra validity checks for CVME count in DRAM Event Record cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record cxl/events: Update Common Event Record to CXL spec rev 3.2 cxl: Fix -Werror=return-type in cxl_decoder_detach() cleanup: Fix documentation build error for ACQUIRE updates cxl: Convert to ACQUIRE() for conditional rwsem locking cxl/region: Consolidate cxl_decoder_kill_region() and cxl_region_detach() cxl/region: Move ready-to-probe state check to a helper cxl/region: Split commit_store() into __commit() and queue_reset() helpers cxl/decoder: Drop pointless locking cxl/decoder: Move decoder register programming to a helper cxl/mbox: Convert poison list mutex to ACQUIRE() cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks cxl: Remove core/acpi.c and cxl core dependency on ACPI cxl/core: Using cxl_resource_contains_addr() to check address availability cxl/edac: Fix wrong dpa checking for PPR operation cxl/core: Introduce a new helper cxl_resource_contains_addr() ...
2025-08-01net: Add locking to protect skb->dev access in ip_outputSharath Chandra Vurukala
In ip_output() skb->dev is updated from the skb_dst(skb)->dev this can become invalid when the interface is unregistered and freed, Introduced new skb_dst_dev_rcu() function to be used instead of skb_dst_dev() within rcu_locks in ip_output.This will ensure that all the skb's associated with the dev being deregistered will be transnmitted out first, before freeing the dev. Given that ip_output() is called within an rcu_read_lock() critical section or from a bottom-half context, it is safe to introduce an RCU read-side critical section within it. Multiple panic call stacks were observed when UL traffic was run in concurrency with device deregistration from different functions, pasting one sample for reference. [496733.627565][T13385] Call trace: [496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0 [496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498 [496733.627595][T13385] ip_finish_output+0xa4/0xf4 [496733.627605][T13385] ip_output+0x100/0x1a0 [496733.627613][T13385] ip_send_skb+0x68/0x100 [496733.627618][T13385] udp_send_skb+0x1c4/0x384 [496733.627625][T13385] udp_sendmsg+0x7b0/0x898 [496733.627631][T13385] inet_sendmsg+0x5c/0x7c [496733.627639][T13385] __sys_sendto+0x174/0x1e4 [496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c [496733.627653][T13385] invoke_syscall+0x58/0x11c [496733.627662][T13385] el0_svc_common+0x88/0xf4 [496733.627669][T13385] do_el0_svc+0x2c/0xb0 [496733.627676][T13385] el0_svc+0x2c/0xa4 [496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4 [496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8 Changes in v3: - Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn. - Dropped legacy lines mistakenly pulled in from an outdated branch. Changes in v2: - Addressed review comments from Eric Dumazet - Used READ_ONCE() to prevent potential load/store tearing - Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01net/sched: taprio: enforce minimum value for picos_per_byteTakamitsu Iwai
Syzbot reported a WARNING in taprio_get_start_time(). When link speed is 470,589 or greater, q->picos_per_byte becomes too small, causing length_to_duration(q, ETH_ZLEN) to return zero. This zero value leads to validation failures in fill_sched_entry() and parse_taprio_schedule(), allowing arbitrary values to be assigned to entry->interval and cycle_time. As a result, sched->cycle can become zero. Since SPEED_800000 is the largest defined speed in include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios. To ensure length_to_duration() returns a non-zero value for minimum-sized Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17 (60 * 17 > PSEC_PER_NSEC which is 1000). This patch enforces a minimum value of 17 for picos_per_byte when the calculated value would be lower, and adds a warning message to inform users that scheduling accuracy may be affected at very high link speeds. Fixes: fb66df20a720 ("net/sched: taprio: extend minimum interval restriction to entire cycle too") Reported-by: syzbot+398e1ee4ca2cac05fddb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=398e1ee4ca2cac05fddb Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Link: https://patch.msgid.link/20250728173149.45585-1-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01net: drop UFO packets in udp_rcv_segment()Wang Liang
When sending a packet with virtio_net_hdr to tun device, if the gso_type in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr size, below crash may happen. ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:4572! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 62 Comm: mytest Not tainted 6.16.0-rc7 #203 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:skb_pull_rcsum+0x8e/0xa0 Code: 00 00 5b c3 cc cc cc cc 8b 93 88 00 00 00 f7 da e8 37 44 38 00 f7 d8 89 83 88 00 00 00 48 8b 83 c8 00 00 00 5b c3 cc cc cc cc <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 000 RSP: 0018:ffffc900001fba38 EFLAGS: 00000297 RAX: 0000000000000004 RBX: ffff8880040c1000 RCX: ffffc900001fb948 RDX: ffff888003e6d700 RSI: 0000000000000008 RDI: ffff88800411a062 RBP: ffff8880040c1000 R08: 0000000000000000 R09: 0000000000000001 R10: ffff888003606c00 R11: 0000000000000001 R12: 0000000000000000 R13: ffff888004060900 R14: ffff888004050000 R15: ffff888004060900 FS: 000000002406d3c0(0000) GS:ffff888084a19000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000040 CR3: 0000000004007000 CR4: 00000000000006f0 Call Trace: <TASK> udp_queue_rcv_one_skb+0x176/0x4b0 net/ipv4/udp.c:2445 udp_queue_rcv_skb+0x155/0x1f0 net/ipv4/udp.c:2475 udp_unicast_rcv_skb+0x71/0x90 net/ipv4/udp.c:2626 __udp4_lib_rcv+0x433/0xb00 net/ipv4/udp.c:2690 ip_protocol_deliver_rcu+0xa6/0x160 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x72/0x90 net/ipv4/ip_input.c:233 ip_sublist_rcv_finish+0x5f/0x70 net/ipv4/ip_input.c:579 ip_sublist_rcv+0x122/0x1b0 net/ipv4/ip_input.c:636 ip_list_rcv+0xf7/0x130 net/ipv4/ip_input.c:670 __netif_receive_skb_list_core+0x21d/0x240 net/core/dev.c:6067 netif_receive_skb_list_internal+0x186/0x2b0 net/core/dev.c:6210 napi_complete_done+0x78/0x180 net/core/dev.c:6580 tun_get_user+0xa63/0x1120 drivers/net/tun.c:1909 tun_chr_write_iter+0x65/0xb0 drivers/net/tun.c:1984 vfs_write+0x300/0x420 fs/read_write.c:593 ksys_write+0x60/0xd0 fs/read_write.c:686 do_syscall_64+0x50/0x1c0 arch/x86/entry/syscall_64.c:63 </TASK> To trigger gso segment in udp_queue_rcv_skb(), we should also set option UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try to pull udphdr, but the skb size has been segmented to gso size, which leads to this crash. Previous commit cf329aa42b66 ("udp: cope with UDP GRO packet misdirection") introduces segmentation in UDP receive path only for GRO, which was never intended to be used for UFO, so drop UFO packets in udp_rcv_segment(). Link: https://lore.kernel.org/netdev/20250724083005.3918375-1-wangliang74@huawei.com/ Link: https://lore.kernel.org/netdev/20250729123907.3318425-1-wangliang74@huawei.com/ Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250730101458.3470788-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01Merge tag 'rproc-v6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux Pull remoteproc updates from Bjorn Andersson: - Make the Xilinx remoteproc driver support running on only a single core, disable still unsupported remoteproc features, and stop the remoteproc on shutdown to facilitate kexec. - Conclude the renaming of the Qualcomm ADSP driver to "PAS" that was started many years ago. * tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: remoteproc: xlnx: Fix kernel-doc warnings remoteproc: xlnx: Disable unsupported features remoteproc: xlnx: Add shutdown callback remoteproc: xlnx: Allow single core use in split mode dt-bindings: remoteproc: qcom,sa8775p-pas: Correct the interrupt number remoteproc: Don't use %pK through printk dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc remoteproc: qcom: pas: Conclude the rename from adsp
2025-08-01selftests/bpf: Test for unaligned flow_dissector ctx accessPaul Chaignon
This patch adds tests for two context fields where unaligned accesses were not properly rejected. Note the new macro is similar to the existing narrow_load macro, but we need a different description and access offset. Combining the two macros into one is probably doable but I don't think it would help readability. vmlinux.h is included in place of bpf.h so we have the definition of struct bpf_nf_ctx. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/bf014046ddcf41677fb8b98d150c14027e9fddba.1754039605.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-01net: mdio: mdio-bcm-unimac: Correct rate fallback logicFlorian Fainelli
When the parent clock is a gated clock which has multiple parents, the clock provider (clk-scmi typically) might return a rate of 0 since there is not one of those particular parent clocks that should be chosen for returning a rate. Prior to ee975351cf0c ("net: mdio: mdio-bcm-unimac: Manage clock around I/O accesses"), we would not always be passing a clock reference depending upon how mdio-bcm-unimac was instantiated. In that case, we would take the fallback path where the rate is hard coded to 250MHz. Make sure that we still fallback to using a fixed rate for the divider calculation, otherwise we simply ignore the desired MDIO bus clock frequency which can prevent us from interfacing with Ethernet PHYs properly. Fixes: ee975351cf0c ("net: mdio: mdio-bcm-unimac: Manage clock around I/O accesses") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250730202533.3463529-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01ipv6: reject malicious packets in ipv6_gso_segment()Eric Dumazet
syzbot was able to craft a packet with very long IPv6 extension headers leading to an overflow of skb->transport_header. This 16bit field has a limited range. Add skb_reset_transport_header_careful() helper and use it from ipv6_gso_segment() WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 skb_reset_transport_header include/linux/skbuff.h:3032 [inline] WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151 Modules linked in: CPU: 0 UID: 0 PID: 5871 Comm: syz-executor211 Not tainted 6.16.0-rc6-syzkaller-g7abc678e3084 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:skb_reset_transport_header include/linux/skbuff.h:3032 [inline] RIP: 0010:ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151 Call Trace: <TASK> skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53 nsh_gso_segment+0x54a/0xe10 net/nsh/nsh.c:110 skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53 __skb_gso_segment+0x342/0x510 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x857/0x11b0 net/core/dev.c:3950 validate_xmit_skb_list+0x84/0x120 net/core/dev.c:4000 sch_direct_xmit+0xd3/0x4b0 net/sched/sch_generic.c:329 __dev_xmit_skb net/core/dev.c:4102 [inline] __dev_queue_xmit+0x17b6/0x3a70 net/core/dev.c:4679 Fixes: d1da932ed4ec ("ipv6: Separate ipv6 offload support") Reported-by: syzbot+af43e647fd835acc02df@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/688a1a05.050a0220.5d226.0008.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250730131738.3385939-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01selftests: avoid using ifconfigEric Dumazet
ifconfig is deprecated and not always present, use ip command instead. Fixes: e0f3b3e5c77a ("selftests: Add test cases for vlan_filter modification during runtime") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dong Chenchen <dongchenchen2@huawei.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250730115313.3356036-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01dpll: Make ZL3073X invisibleGeert Uytterhoeven
Currently, the user is always asked about the Microchip Azurite DPLL/PTP/SyncE core driver, even when I2C and SPI are disabled, and thus the driver cannot be used at all. Fix this by making the Kconfig symbol for the core driver invisible (unless compile-testing), and selecting it by the bus glue sub-drivers. Drop the modular defaults, as drivers should not default to enabled. Fixes: 2df8e64e01c10a4b ("dpll: Add basic Microchip ZL3073x support") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://patch.msgid.link/97804163aeb262f0e0706d00c29d9bb751844454.1753874405.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01net/mlx5: Correctly set gso_segs when LRO is usedChristoph Paasch
When gso_segs is left at 0, a number of assumptions will end up being incorrect throughout the stack. For example, in the GRO-path, we set NAPI_GRO_CB()->count to gso_segs. So, if a non-LRO'ed packet followed by an LRO'ed packet is being processed in GRO, the first one will have NAPI_GRO_CB()->count set to 1 and the next one to 0 (in dev_gro_receive()). Since commit 531d0d32de3e ("net/mlx5: Correctly set gso_size when LRO is used") these packets will get merged (as their gso_size now matches). So, we end up in gro_complete() with NAPI_GRO_CB()->count == 1 and thus don't call inet_gro_complete(). Meaning, checksum-validation in tcp_checksum_complete() will fail with a "hw csum failure". Even before the above mentioned commit, incorrect gso_segs means that other things like TCP's accounting of incoming packets (tp->segs_in, data_segs_in, rcv_ooopack) will be incorrect. Which means that if one does bytes_received/data_segs_in, the result will be bigger than the MTU. Fix this by initializing gso_segs correctly when LRO is used. Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Reported-by: Gal Pressman <gal@nvidia.com> Closes: https://lore.kernel.org/netdev/6583783f-f0fb-4fb1-a415-feec8155bc69@nvidia.com/ Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250729-mlx5_gso_segs-v1-1-b48c480c1c12@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - vhost can now support legacy threading if enabled in Kconfig - vsock memory allocation strategies for large buffers have been improved, reducing pressure on kmalloc - vhost now supports the in-order feature. guest bits missed the merge window. - fixes, cleanups all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits) vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers vsock/virtio: Rename virtio_vsock_skb_rx_put() vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers vsock/virtio: Move SKB allocation lower-bound check to callers vsock/virtio: Rename virtio_vsock_alloc_skb() vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() vsock/virtio: Validate length in packet header before skb_put() vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_net: basic in_order support vhost: basic in order support vhost: fail early when __vhost_add_used() fails vhost: Reintroduce kthread API and add mode selection vdpa: Fix IDR memory leak in VDUSE module exit vdpa/mlx5: Fix release of uninitialized resources on error path vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit virtio: virtio_dma_buf: fix missing parameter documentation vhost: Fix typos vhost: vringh: Remove unused functions vhost: vringh: Remove unused iotlb functions ...
2025-08-01sfc: unfix not-a-typo in commentEdward Cree
Commit fe09560f8241 ("net: Fix typos") removed duplicated word 'fallback', but this was not a typo and change altered the semantic meaning of the comment. Partially revert, using the phrase 'fallback of the fallback' to make the meaning more clear to future readers so that they won't try to change it again. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250731144138.2637949-1-edward.cree@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01Merge tag 'pci-v6.17-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Allow built-in drivers, not just modular drivers, to use async initial probing (Lukas Wunner) - Support Immediate Readiness even on devices with no PM Capability (Sean Christopherson) - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the required delay between a reset and sending config requests to a device (Niklas Cassel) - Add pci_is_display() to check for "Display" base class and use it in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello) - Allow 'isolated PCI functions' (multi-function devices without a function 0) for LoongArch, similar to s390 and jailhouse (Huacai Chen) Power control: - Add ability to enable optional slot clock for cases where the PCIe host controller and the slot are supplied by different clocks (Marek Vasut) PCIe native device hotplug: - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by misinterpreting a config read failure after a device has been removed (Lukas Wunner) - Avoid creating a useless PCIe port service device for pciehp if the slot is handled by the ACPI hotplug driver (Lukas Wunner) - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug ports (Lukas Wunner) Virtualization: - Save VF resizable BAR state and restore it after reset (Michał Winiarski) - Allow IOV resources (VF BARs) to be resized (Michał Winiarski) - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size (Michał Winiarski) Endpoint framework: - Add RC-to-EP doorbell support using platform MSI controller, including a test case (Frank Li) - Allow BAR assignment via configfs so platforms have flexibility in determining BAR usage (Jerome Brunet) Native PCIe controller drivers: - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie, axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to DT schema format (Rob Herring) - Use dev_fwnode() instead of of_fwnode_handle() to remove OF dependency in altera (fixes an unused variable), designware-host, mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma, xilinx-nwl (Jiri Slaby, Arnd Bergmann) - Convert aardvark, altera, brcmstb, designware-host, iproc, mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx, xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to using msi_create_parent_irq_domain() instead; this makes the interrupt controller per-PCI device, allows dynamic allocation of vectors after initialization, and allows support of IMS (Nam Cao) APM X-Gene PCIe controller driver: - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug bits, use device-managed memory allocations, and clean things up (Marc Zyngier) - Probe xgene-msi as a standard platform driver rather than a subsys_initcall (Marc Zyngier) Broadcom STB PCIe controller driver: - Add optional DT 'num-lanes' property and if present, use it to override the Maximum Link Width advertised in Link Capabilities (Jim Quinlan) Cadence PCIe controller driver: - Use PCIe Message routing types from the PCI core rather than defining private ones (Hans Zhang) Freescale i.MX6 PCIe controller driver: - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu) - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features (Richard Zhu) - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can trigger doorbel on Endpoint (Frank Li) - Remove apps_reset (LTSSM_EN) from imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug regression on i.MX8MM (Richard Zhu) - Delay Endpoint link start until configfs 'start' written (Richard Zhu) Intel VMD host bridge driver: - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo) Qualcomm PCIe controller driver: - Add DT binding and driver support for SA8255p, which supports ECAM for Configuration Space access (Mayank Rana) - Update DT binding and driver to describe PHYs and per-Root Port resets in a Root Port stanza and deprecate describing them in the host bridge; this makes it possible to support multiple Root Ports in the future (Krishna Chaitanya Chundru) - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang) - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang) - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT bindings (Konrad Dybcio) - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue Zhang) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Rockchip PCIe controller driver: - Drop unused PCIe Message routing and code definitions (Hans Zhang) - Remove several unused header includes (Hans Zhang) - Use standard PCIe config register definitions instead of rockchip-specific redefinitions (Geraldo Nascimento) - Set Target Link Speed to 5.0 GT/s before retraining so we have a chance to train at a higher speed (Geraldo Nascimento) Rockchip DesignWare PCIe controller driver: - Prevent race between link training and register update via DBI by inhibiting link training after hot reset and link down (Wilfred Mallawa) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Sophgo PCIe controller driver: - Add DT binding and driver for Sophgo SG2044 PCIe controller driver in Root Complex mode (Inochi Amaoto) Synopsys DesignWare PCIe controller driver: - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on Ports that support > 5.0 GT/s. Slower Ports still rely on the not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while waiting for the Link (Niklas Cassel)" * tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits) dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset dt-bindings: PCI: Remove 83xx-512x-pci.txt dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema dt-bindings: PCI: Convert apm,xgene-pcie to DT schema dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema dt-bindings: PCI: Convert st,spear1340-pcie to DT schema PCI: Move is_pciehp check out of pciehp_is_native() PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports selftests: pci_endpoint: Add doorbell test case misc: pci_endpoint_test: Add doorbell test case PCI: endpoint: pci-epf-test: Add doorbell test support PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode PCI: vmd: Switch to msi_create_parent_irq_domain() PCI: vmd: Convert to lock guards ...
2025-08-01net: airoha: Fix PPE table access in airoha_ppe_debugfs_foe_show()Lorenzo Bianconi
In order to avoid any possible race we need to hold the ppe_lock spinlock accessing the hw PPE table. airoha_ppe_foe_get_entry routine is always executed holding ppe_lock except in airoha_ppe_debugfs_foe_show routine. Fix the problem introducing airoha_ppe_foe_get_entry_locked routine. Fixes: 3fe15c640f380 ("net: airoha: Introduce PPE debugfs support") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250731-airoha_ppe_foe_get_entry_locked-v2-1-50efbd8c0fd6@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01selftests: net: Fix flaky neighbor garbage collection testIdo Schimmel
The purpose of the "Periodic garbage collection" test case is to make sure that "extern_valid" neighbors are not flushed during periodic garbage collection, unlike regular neighbor entries. The test case is currently doing the following: 1. Changing the base reachable time to 10 seconds so that periodic garbage collection will run every 5 seconds. 2. Changing the garbage collection stale time to 5 seconds so that neighbors that have not been used in the last 5 seconds will be considered for removal. 3. Waiting for the base reachable time change to take effect. 4. Adding an "extern_valid" neighbor, a non-"extern_valid" neighbor and a bunch of other neighbors so that the threshold ("thresh1") will be crossed and stale neighbors will be flushed during garbage collection. 5. Waiting for 10 seconds to give garbage collection a chance to run. 6. Checking that the "extern_valid" neighbor was not flushed and that the non-"extern_valid" neighbor was flushed. The test sometimes fails in the netdev CI because the non-"extern_valid" neighbor was not flushed. I am unable to reproduce this locally, but my theory that since we do not know exactly when the periodic garbage collection runs, it is possible for it to run at a time when the non-"extern_valid" neighbor is still not considered stale. Fix by moving the addition of the two neighbors before step 3 and by reducing the garbage collection stale time to 1 second, to ensure that both neighbors are considered stale when garbage collection runs. Fixes: 171f2ee31a42 ("selftests: net: Add a selftest for externally validated neighbor entries") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20250728093504.4ebbd73c@kernel.org/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250731110914.506890-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-01ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)Steven Rostedt
The function ring_buffer_write() has a goto out to only do a preempt_enable_notrace(). This can be replaced by a guard. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250801203858.205479143@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>