summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-25selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flagAudra Mitchell
In order for the page table level 5 to be in use, the CPU must have the setting enabled in addition to the CONFIG option. Check for the flag to be set to avoid false test failures on systems that do not have this cpu flag set. The test does a series of mmap calls including three using the MAP_FIXED flag and specifying an address that is 1<<47 or 1<<48. These addresses are only available if you are using level 5 page tables, which requires both the CPU to have the capabiltiy (la57 flag) and the kernel to be configured. Currently the test only checks for the kernel configuration option, so this test can still report a false positive. Here are the three failing lines: $ ./va_high_addr_switch | grep FAILED mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED mmap(HIGH_ADDR, MAP_FIXED): 0xffffffffffffffff - FAILED mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED I thought (for about a second) refactoring the test so that these three mmap calls will only be run on systems with the level 5 page tables available, but the whole point of the test is to check the level 5 feature... Link: https://lkml.kernel.org/r/20240119205801.62769-1-audra@redhat.com Fixes: 4f2930c6718a ("selftests/vm: only run 128TBswitch with 5-level paging") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Adam Sindelar <adam@wowsignal.io> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25selftests: mm: fix map_hugetlb failure on 64K page size systemsNico Pache
On systems with 64k page size and 512M huge page sizes, the allocation and test succeeds but errors out at the munmap. As the comment states, munmap will failure if its not HUGEPAGE aligned. This is due to the length of the mapping being 1/2 the size of the hugepage causing the munmap to not be hugepage aligned. Fix this by making the mapping length the full hugepage if the hugepage is larger than the length of the mapping. Link: https://lkml.kernel.org/r/20240119131429.172448-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25MAINTAINERS: supplement of zswap maintainers updateYosry Ahmed
As discussed on the mailing list [1], merge the zpool maintainers entry into the zswap one. Also, add CREDITS entries for previous zswap/zpool maintainers. [1] https://lore.kernel.org/linux-mm/CAJD7tkYx4YWhGoVwnSeGc8dY_1aRRxxg8PzWBV==A6iqG_OgFw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240117182152.1439822-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjenning@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25stackdepot: make fast paths lock-less againMarco Elver
With the introduction of the pool_rwlock (reader-writer lock), several fast paths end up taking the pool_rwlock as readers. Furthermore, stack_depot_put() unconditionally takes the pool_rwlock as a writer. Despite allowing readers to make forward-progress concurrently, reader-writer locks have inherent cache contention issues, which does not scale well on systems with large CPU counts. Rework the synchronization story of stack depot to again avoid taking any locks in the fast paths. This is done by relying on RCU-protected list traversal, and the NMI-safe subset of RCU to delay reuse of freed stack records. See code comments for more details. Along with the performance issues, this also fixes incorrect nesting of rwlock within a raw_spinlock, given that stack depot should still be usable from anywhere: | [ BUG: Invalid wait context ] | ----------------------------- | swapper/0/1 is trying to lock: | ffffffff89869be8 (pool_rwlock){..--}-{3:3}, at: stack_depot_save_flags | other info that might help us debug this: | context-{5:5} | 2 locks held by swapper/0/1: | #0: ffffffff89632440 (rcu_read_lock){....}-{1:3}, at: __queue_work | #1: ffff888100092018 (&pool->lock){-.-.}-{2:2}, at: __queue_work <-- raw_spin_lock Stack depot usage stats are similar to the previous version after a KASAN kernel boot: $ cat /sys/kernel/debug/stackdepot/stats pools: 838 allocations: 29865 frees: 6604 in_use: 23261 freelist_size: 1879 The number of pools is the same as previously. The freelist size is minimally larger, but this may also be due to variance across system boots. This shows that even though we do not eagerly wait for the next RCU grace period (such as with synchronize_rcu() or call_rcu()) after freeing a stack record - requiring depot_pop_free() to "poll" if an entry may be used - new allocations are very likely to happen in later RCU grace periods. Link: https://lkml.kernel.org/r/20240118110216.2539519-2-elver@google.com Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25stackdepot: add stats counters exported via debugfsMarco Elver
Add a few basic stats counters for stack depot that can be used to derive if stack depot is working as intended. This is a snapshot of the new stats after booting a system with a KASAN-enabled kernel: $ cat /sys/kernel/debug/stackdepot/stats pools: 838 allocations: 29861 frees: 6561 in_use: 23300 freelist_size: 1840 Generally, "pools" should be well below the max; once the system is booted, "in_use" should remain relatively steady. Link: https://lkml.kernel.org/r/20240118110216.2539519-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25mm, kmsan: fix infinite recursion due to RCU critical sectionMarco Elver
Alexander Potapenko writes in [1]: "For every memory access in the code instrumented by KMSAN we call kmsan_get_metadata() to obtain the metadata for the memory being accessed. For virtual memory the metadata pointers are stored in the corresponding `struct page`, therefore we need to call virt_to_page() to get them. According to the comment in arch/x86/include/asm/page.h, virt_to_page(kaddr) returns a valid pointer iff virt_addr_valid(kaddr) is true, so KMSAN needs to call virt_addr_valid() as well. To avoid recursion, kmsan_get_metadata() must not call instrumented code, therefore ./arch/x86/include/asm/kmsan.h forks parts of arch/x86/mm/physaddr.c to check whether a virtual address is valid or not. But the introduction of rcu_read_lock() to pfn_valid() added instrumented RCU API calls to virt_to_page_or_null(), which is called by kmsan_get_metadata(), so there is an infinite recursion now. I do not think it is correct to stop that recursion by doing kmsan_enter_runtime()/kmsan_exit_runtime() in kmsan_get_metadata(): that would prevent instrumented functions called from within the runtime from tracking the shadow values, which might introduce false positives." Fix the issue by switching pfn_valid() to the _sched() variant of rcu_read_lock/unlock(), which does not require calling into RCU. Given the critical section in pfn_valid() is very small, this is a reasonable trade-off (with preemptible RCU). KMSAN further needs to be careful to suppress calls into the scheduler, which would be another source of recursion. This can be done by wrapping the call to pfn_valid() into preempt_disable/enable_no_resched(). The downside is that this sacrifices breaking scheduling guarantees; however, a kernel compiled with KMSAN has already given up any performance guarantees due to being heavily instrumented. Note, KMSAN code already disables tracing via Makefile, and since mmzone.h is included, it is not necessary to use the notrace variant, which is generally preferred in all other cases. Link: https://lkml.kernel.org/r/20240115184430.2710652-1-glider@google.com [1] Link: https://lkml.kernel.org/r/20240118110022.2538350-1-elver@google.com Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: syzbot+93a9e8a3dea8d6085e12@syzkaller.appspotmail.com Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), againZach O'Keefe
(struct dirty_throttle_control *)->thresh is an unsigned long, but is passed as the u32 divisor argument to div_u64(). On architectures where unsigned long is 64 bytes, the argument will be implicitly truncated. Use div64_u64() instead of div_u64() so that the value used in the "is this a safe division" check is the same as the divisor. Also, remove redundant cast of the numerator to u64, as that should happen implicitly. This would be difficult to exploit in memcg domain, given the ratio-based arithmetic domain_drity_limits() uses, but is much easier in global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using e.g. vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32) Link: https://lkml.kernel.org/r/20240118181954.1415197-1-zokeefe@google.com Fixes: f6789593d5ce ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()") Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Maxim Patlasov <MPatlasov@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25selftests/mm: switch to bash from shMuhammad Usama Anjum
Running charge_reserved_hugetlb.sh generates errors if sh is set to dash: ./charge_reserved_hugetlb.sh: 9: [[: not found ./charge_reserved_hugetlb.sh: 19: [[: not found ./charge_reserved_hugetlb.sh: 27: [[: not found ./charge_reserved_hugetlb.sh: 37: [[: not found ./charge_reserved_hugetlb.sh: 45: Syntax error: "(" unexpected Switch to using /bin/bash instead of /bin/sh. Make the switch for write_hugetlb_memory.sh as well which is called from charge_reserved_hugetlb.sh. Link: https://lkml.kernel.org/r/20240116090455.3407378-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25MAINTAINERS: add man-pages git treesPetr Vorel
The maintainer uses both. Link: https://lkml.kernel.org/r/20240117122257.2707637-1-pvorel@suse.cz Signed-off-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Alejandro Colomar <alx@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25mm: memcontrol: don't throttle dying tasks on memory.highJohannes Weiner
While investigating hosts with high cgroup memory pressures, Tejun found culprit zombie tasks that had were holding on to a lot of memory, had SIGKILL pending, but were stuck in memory.high reclaim. In the past, we used to always force-charge allocations from tasks that were exiting in order to accelerate them dying and freeing up their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: prohibit unconditional exceeding the limit of dying tasks"); it noted that this can cause (userspace inducable) containment failures, so it added a mandatory reclaim and OOM kill cycle before forcing charges. At the time, memory.high enforcement was handled in the userspace return path, which isn't reached by dying tasks, and so memory.high was still never enforced by dying tasks. When c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") added synchronous reclaim for memory.high, it added unconditional memory.high enforcement for dying tasks as well. The callstack shows that this path is where the zombie is stuck in. We need to accelerate dying tasks getting past memory.high, but we cannot do it quite the same way as we do for memory.max: memory.max is enforced strictly, and tasks aren't allowed to move past it without FIRST reclaiming and OOM killing if necessary. This ensures very small levels of excess. With memory.high, though, enforcement happens lazily after the charge, and OOM killing is never triggered. A lot of concurrent threads could have pushed, or could actively be pushing, the cgroup into excess. The dying task will enter reclaim on every allocation attempt, with little hope of restoring balance. To fix this, skip synchronous memory.high enforcement on dying tasks altogether again. Update memory.high path documentation while at it. [hannes@cmpxchg.org: also handle tasks are being killed during the reclaim] Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25mm: mmap: map MAP_STACK to VM_NOHUGEPAGEYang Shi
commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") incured regression for stress-ng pthread benchmark [1]. It is because THP get allocated to pthread's stack area much more possible than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. The MAP_STACK flag is used to mark the stack area, but it is a no-op on Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating THP for such stack area. With this change the stack area looks like: fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 Size: 8192 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 12 kB Pss: 12 kB Pss_Dirty: 12 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 12 kB Referenced: 12 kB Anonymous: 12 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd wr mr mw me ac nh The "nh" flag is set. [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20231221065943.2803551-2-shy828301@gmail.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25uprobes: use pagesize-aligned virtual address when replacing pagesDavid Hildenbrand
uprobes passes an unaligned page mapping address to folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we recently extended in commit 372cbd4d5a066 ("mm: non-pmd-mappable, large folios for folio_add_new_anon_rmap()"). Arguably, this is uprobes code doing something wrong; however, for the time being it would have likely worked in rmap code because __folio_set_anon() would set folio->index to the same value. Looking at __replace_page(), we'd also pass slightly wrong values to mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(), ptep_clear_flush() and set_pte_at_notify(). I suspect most of them are fine, but let's just mark the introducing commit as the one needed fixing. I don't think CC stable is warranted. We'll add more sanity checks in rmap code separately, to make sure that we always get properly aligned addresses. Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com Fixes: c517ee744b96 ("uprobes: __replace_page() should not use page_address_in_vma()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Olsa <jolsa@kernel.org> Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava Tested-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25selftests/mm: mremap_test: fix build warningMuhammad Usama Anjum
Use 2 separate variables of types int and unsigned long long instead of confusing them. This corrects the correct print format for each of them and removes the build warning: warning: format `%d' expects argument of type `int', but argument 2 has type `long long unsigned int' Link: https://lkml.kernel.org/r/20240112071851.612930-1-usama.anjum@collabora.com Fixes: a4cb3b243343 ("selftests: mm: add a test for remapping to area immediately after existing mapping") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handlingSidhartha Kumar
has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25readahead: avoid multiple marked readahead pagesJan Kara
ra_alloc_folio() marks a page that should trigger next round of async readahead. However it rounds up computed index to the order of page being allocated. This can however lead to multiple consecutive pages being marked with readahead flag. Consider situation with index == 1, mark == 1, order == 0. We insert order 0 page at index 1 and mark it. Then we bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page at index 2 is marked as well. Then we bump order to 2, index is incremented to 4, mark gets rounded to 4 so page at index 4 is marked as well. The fact that multiple pages get marked within a single readahead window confuses the readahead logic and results in readahead window being trimmed back to 1. This situation is triggered in particular when maximum readahead window size is not a power of two (in the observed case it was 768 KB) and as a result sequential read throughput suffers. Fix the problem by rounding 'mark' down instead of up. Because the index is naturally aligned to 'order', we are guaranteed 'rounded mark' == index iff 'mark' is within the page we are allocating at 'index' and thus exactly one page is marked with readahead flag as required by the readahead code and sequential read performance is restored. This effectively reverts part of commit b9ff43dd2743 ("mm/readahead: Fix readahead with large folios"). The commit changed the rounding with the rationale: "... we were setting the readahead flag on the folio which contains the last byte read from the block. This is wrong because we will trigger readahead at the end of the read without waiting to see if a subsequent read is going to use the pages we just read." Although this is true, the fact is this was always the case with read sizes not aligned to folio boundaries and large folios in the page cache just make the situation more obvious (and frequent). Also for sequential read workloads it is better to trigger the readahead earlier rather than later. It is true that the difference in the rounding and thus earlier triggering of the readahead can result in reading more for semi-random workloads. However workloads really suffering from this seem to be rare. In particular I have verified that the workload described in commit b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") of reading random 100k blocks from a file like: [reader] bs=100k rw=randread numjobs=1 size=64g runtime=60s is not impacted by the rounding change and achieves ~70MB/s in both cases. [jack@suse.cz: fix one more place where mark rounding was done as well] Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@suse.cz Fixes: b9ff43dd2743 ("mm/readahead: Fix readahead with large folios") Signed-off-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Guo Xuenan <guoxuenan@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-26drm/sched: Drain all entities in DRM sched run job workerMatthew Brost
All entities must be drained in the DRM scheduler run job worker to avoid the following case. An entity found that is ready, no job found ready on entity, and run job worker goes idle with other entities + jobs ready. Draining all ready entities (i.e. loop over all ready entities) in the run job worker ensures all job that are ready will be scheduled. Cc: Thorsten Leemhuis <regressions@leemhuis.info> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/ Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/ Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240124210811.1639040-1-matthew.brost@intel.com
2024-01-26Merge tag 'amd-drm-fixes-6.8-2024-01-25' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.8-2024-01-25: amdgpu: - AC/DC power supply tracking fix - Don't show invalid vram vendor data - SMU 13.0.x fixes - GART fix for umr on systems without VRAM - GFX 10/11 UNORD_DISPATCH fixes - IPS display fixes (required for S0ix on some platforms) - Misc fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240125221503.5019-1-alexander.deucher@amd.com
2024-01-26Merge tag 'drm-xe-fixes-2024-01-25' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Make an ops struct static - Fix an implicit 0 to NULL conversion - A couple of 32-bit fixes - A migration coherency fix for Lunar Lake. - An error path vm id leak fix - Remove PVC references in kunit tests Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZbIb7l0EhpVp5cXE@fedora
2024-01-25net: phy: mediatek-ge-soc: sync driver with MediaTek SDKDaniel Golle
Sync initialization and calibration routines with MediaTek's reference driver. Improves compliance and resolves link stability issues with CH340 IoT devices connected to MT798x built-in PHYs. Fixes: 98c485eaf509 ("net: phy: add driver for MediaTek SoC built-in GE PHYs") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/f2195279c234c0f618946424b8236026126bc595.1706071311.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25net: ethernet: mtk_eth_soc: set DMA coherent mask to get PPE workingDaniel Golle
Set DMA coherent mask to 32-bit which makes PPE offloading engine start working on BPi-R4 which got 4 GiB of RAM. Fixes: 2d75891ebc09 ("net: ethernet: mtk_eth_soc: support 36-bit DMA addressing on MT7988") Suggested-by: Elad Yifee <eladwf@users.github.com> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/97e90925368b405f0974b9b15f1b7377c4a329ad.1706113251.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge branch 'nfp-flower-a-few-small-conntrack-offload-fixes'Jakub Kicinski
Louis Peens says: ==================== nfp: flower: a few small conntrack offload fixes This small series addresses two bugs in the nfp conntrack offloading code. The first patch is a check to prevent offloading for a case which is currently not supported by the nfp. The second patch fixes up parsing of layer4 mangling code so it can be correctly offloaded. Since the masks are an inverse mask and we are shifting it so it can be packed together with the destination we effectively need to 'clear' the lower bits of the mask by setting it to 0xFFFF. ==================== Link: https://lore.kernel.org/r/20240124151909.31603-1-louis.peens@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25nfp: flower: fix hardware offload for the transfer layer portHui Zhou
The nfp driver will merge the tp source port and tp destination port into one dword which the offset must be zero to do hardware offload. However, the mangle action for the tp source port and tp destination port is separated for tc ct action. Modify the mangle action for the FLOW_ACT_MANGLE_HDR_TYPE_TCP and FLOW_ACT_MANGLE_HDR_TYPE_UDP to satisfy the nfp driver offload check for the tp port. The mangle action provides a 4B value for source, and a 4B value for the destination, but only 2B of each contains the useful information. For offload the 2B of each is combined into a single 4B word. Since the incoming mask for the source is '0xFFFF<mask>' the shift-left will throw away the 0xFFFF part. When this gets combined together in the offload it will clear the destination field. Fix this by setting the lower bits back to 0xFFFF, effectively doing a rotate-left operation on the mask. Fixes: 5cee92c6f57a ("nfp: flower: support hw offload for ct nat action") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Hui Zhou <hui.zhou@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Link: https://lore.kernel.org/r/20240124151909.31603-3-louis.peens@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25nfp: flower: add hardware offload check for post ct entryHui Zhou
The nfp offload flow pay will not allocate a mask id when the out port is openvswitch internal port. This is because these flows are used to configure the pre_tun table and are never actually send to the firmware as an add-flow message. When a tc rule which action contains ct and the post ct entry's out port is openvswitch internal port, the merge offload flow pay with the wrong mask id of 0 will be send to the firmware. Actually, the nfp can not support hardware offload for this situation, so return EOPNOTSUPP. Fixes: bd0fe7f96a3c ("nfp: flower-ct: add zone table entry when handling pre/post_ct flows") CC: stable@vger.kernel.org # 5.14+ Signed-off-by: Hui Zhou <hui.zhou@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Link: https://lore.kernel.org/r/20240124151909.31603-2-louis.peens@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25gve: Fix skb truesize underestimationPraveen Kaligineedi
For a skb frag with a newly allocated copy page, the true size is incorrectly set to packet buffer size. It should be set to PAGE_SIZE instead. Fixes: 82fd151d38d9 ("gve: Reduce alloc and copy costs in the GQ rx path") Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Link: https://lore.kernel.org/r/20240124161025.1819836-1-pkaligineedi@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests/net/lib: update busywait timeout valueHangbin Liu
The busywait timeout value is a millisecond, not a second. So the current setting 2 is too small. On slow/busy host (or VMs) the current timeout can expire even on "correct" execution, causing random failures. Let's copy the WAIT_TIMEOUT from forwarding/lib.sh and set BUSYWAIT_TIMEOUT here. Fixes: 25ae948b4478 ("selftests/net: add lib.sh") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240124061344.1864484-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25bcachefs: __lookup_dirent() works in snapshot, not subvolKent Overstreet
Add a new helper, bch2_hash_lookup_in_snapshot(), for when we're not operating in a subvolume and already have a snapshot ID, and then use it in lookup_lostfound() -> __lookup_dirent(). This is a bugfix - lookup_lostfound() doesn't take a subvolume ID, we were passing a nonsense subvolume ID before, and don't have one to pass since we may be operating in an interior snapshot node that doesn't have a subvolume ID. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-25Merge tag 'md-6.8-20240126' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.8 Pull MD fix from Song: "This change fixes a RCU warning." * tag 'md-6.8-20240126' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix a suspicious RCU usage warning
2024-01-25selftests: tcp_ao: set the timeout to 2 minutesJakub Kicinski
The default timeout for tests is 45sec, bench-lookups_ipv6 seems to take around 50sec when running in a VM without HW acceleration. Give it a 2x margin and set the timeout to 120sec. Fixes: d1066c9c58d4 ("selftests/net: Add test/benchmark for removing MKTs") Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://lore.kernel.org/r/20240124233630.1977708-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge branch 'selftests-net-a-few-fixes'Jakub Kicinski
Paolo Abeni says: ==================== selftests: net: a few fixes This series address self-tests failures for udp gro-related tests. The first patch addresses the main problem I observe locally - the XDP program required by such tests, xdp_dummy, is currently build in the ebpf self-tests directory, not available if/when the user targets net only. Arguably is more a refactor than a fix, but still targeting net to hopefully The second patch fixes the integration of such tests with the build system. Patch 3/3 fixes sporadic failures due to races. Tested with: make -C tools/testing/selftests/ TARGETS=net install ./tools/testing/selftests/kselftest_install/run_kselftest.sh \ -t "net:udpgro_bench.sh net:udpgro.sh net:udpgro_fwd.sh \ net:udpgro_frglist.sh net:veth.sh" no failures. ==================== Link: https://lore.kernel.org/r/cover.1706131762.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: net: explicitly wait for listener readyPaolo Abeni
The UDP GRO forwarding test still hard-code an arbitrary pause to wait for the UDP listener becoming ready in background. That causes sporadic failures depending on the host load. Replace the sleep with the existing helper waiting for the desired port being exposed. Fixes: a062260a9d5f ("selftests: net: add UDP GRO forwarding self-tests") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/4d58900fb09cef42749cfcf2ad7f4b91a97d225c.1706131762.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: net: included needed helper in the install targetsPaolo Abeni
The blamed commit below introduce a dependency in some net self-tests towards a newly introduce helper script. Such script is currently not included into the TEST_PROGS_EXTENDED list and thus is not installed, causing failure for the relevant tests when executed from the install dir. Fix the issue updating the install targets. Fixes: 3bdd9fd29cb0 ("selftests/net: synchronize udpgro tests' tx and rx connection") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/076e8758e21ff2061cc9f81640e7858df775f0a9.1706131762.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: net: remove dependency on ebpf testsPaolo Abeni
Several net tests requires an XDP program build under the ebpf directory, and error out if such program is not available. That makes running successful net test hard, let's duplicate into the net dir the [very small] program, re-using the existing rules to build it, and finally dropping the bogus dependency. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/28e7af7c031557f691dc8045ee41dd549dd5e74c.1706131762.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: tcp_ao: add a config fileJakub Kicinski
Still a bit unclear whether each directory should have its own config file, but assuming they should lets add one for tcp_ao. The following tests still fail with this config in place: - rst_ipv4, - rst_ipv6, - bench-lookups_ipv6. other 21 pass. Fixes: d11301f65977 ("selftests/net: Add TCP-AO ICMPs accept test") Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://lore.kernel.org/r/20240124192550.1865743-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25spi: fix finalize message on error returnDavid Lechner
In __spi_pump_transfer_message(), the message was not finalized in the first error return as it is in the other error return paths. Not finalizing the message could cause anything waiting on the message to complete to hang forever. This adds the missing call to spi_finalize_current_message(). Fixes: ae7d2346dc89 ("spi: Don't use the message queue if possible in spi_sync") Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://msgid.link/r/20240125205312.3458541-2-dlechner@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-01-25drm/amd/display: "Enable IPS by default"Roman Li
[Why] IPS was temporary disabled due to instability. It was fixed in dmub firmware and with: - "drm/amd/display: Add IPS checks before dcn register access" - "drm/amd/display: Disable ips before dc interrupt setting" [How] Enable IPS by default. Disable IPS if 0x800 bit set in amdgpu.dcdebugmask module params Signed-off-by: Roman Li <Roman.Li@amd.com> Tested-by: Mark Broadworth <mark.broadworth@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd: Add a DC debug mask for IPSRoman Li
For debugging IPS-related issues, expose a new debug mask that allows to disable IPS. Usage: amdgpu.dcdebugmask=0x800 Signed-off-by: Roman Li <Roman.Li@amd.com> Tested-by: Mark Broadworth <mark.broadworth@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd/display: Disable ips before dc interrupt settingRoman Li
[Why] While in IPS2 an access to dcn registers is not allowed. If interrupt results in dc call, we should disable IPS. [How] Safeguard register access in IPS2 by disabling idle optimization before calling dc interrupt setting api. Signed-off-by: Roman Li <Roman.Li@amd.com> Tested-by: Mark Broadworth <mark.broadworth@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd/display: Replay + IPS + ABM in Full Screen VPBChunTao Tso
[Why] Because ABM will wait VStart to start getting histogram data, it will cause we can't enter IPS while full screnn video playing. [How] Modify the panel refresh rate to the maximun multiple of current refresh rate. Reviewed-by: Dennis Chan <dennis.chan@amd.com> Acked-by: Roman Li <roman.li@amd.com> Signed-off-by: ChunTao Tso <chuntao.tso@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd/display: Add IPS checks before dcn register accessRoman Li
[Why] With IPS enabled a system hangs once PSR is active. PSR active triggers transition to IPS2 state. While in IPS2 an access to dcn registers results in hard hang. Existing check doesn't cover for PSR sequence. [How] Safeguard register access by disabling idle optimization in atomic commit and crtc scanout. It will be re-enabled on next vblank. Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Acked-by: Roman Li <roman.li@amd.com> Signed-off-by: Roman Li <roman.li@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd/display: Add Replay IPS register for DMUB command tableAlvin Lee
- Introduce a new Replay mode for DMUB version 0.0.199.0 Reviewed-by: Martin Leung <martin.leung@amd.com> Acked-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alvin Lee <alvin.lee2@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amd/display: Allow IPS2 during ReplayNicholas Kazlauskas
[Why & How] Add regkey to block video playback in IPS2 by default Allow idle optimizations in the same spot we allow Replay for video playback usecases. Avoid sending it when there's an external display connected by modifying the allow idle checks to check for active non-eDP screens. Reviewed-by: Charlene Liu <charlene.liu@amd.com> Acked-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25drm/amdgpu/gfx11: set UNORD_DISPATCH in compute MQDsAlex Deucher
This needs to be set to 1 to avoid a potential deadlock in the GC 10.x and newer. On GC 9.x and older, this needs to be set to 0. This can lead to hangs in some mixed graphics and compute workloads. Updated firmware is also required for AQL. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-01-25drm/amdgpu/gfx10: set UNORD_DISPATCH in compute MQDsAlex Deucher
This needs to be set to 1 to avoid a potential deadlock in the GC 10.x and newer. On GC 9.x and older, this needs to be set to 0. This can lead to hangs in some mixed graphics and compute workloads. Updated firmware is also required for AQL. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-01-25drm/amd/amdgpu: Assign GART pages to AMD device mappingTom St Denis
This allows kernel mapped pages like the PDB and PTB to be read via the iomem debugfs when there is no vram in the system. Signed-off-by: Tom St Denis <tom.stdenis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.7.x
2024-01-25drm/amd/pm: Fetch current power limit from FWLijo Lazar
Power limit of SMUv13.0.6 SOCs can be updated by out-of-band ways. Fetch the limit from firmware instead of using cached values. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.7.x
2024-01-25drm/amdgpu: Fix null pointer dereferenceHawking Zhang
amdgpu_reg_state_sysfs_fini could be invoked at the time when asic_func is even not initialized, i.e., amdgpu_discovery_init fails for some reason. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-01-25drm/amdgpu: Show vram vendor only if availableLijo Lazar
Ony if vram vendor info is available, show in sysfs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.7.x
2024-01-25drm/amd/pm: update the power cap settingKenneth Feng
update the power cap setting for smu_v13.0.0/smu_v13.0.7 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2356 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-01-25drm/amdgpu: Avoid fetching vram vendor informationLijo Lazar
For GFX 9.4.3 APUs, the current method of fetching vram vendor information is not reliable. Avoid fetching the information. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.7.x
2024-01-25drm/amdgpu/pm: Fix the power source flag errorMa Jun
The power source flag should be updated when [1] System receives an interrupt indicating that the power source has changed. [2] System resumes from suspend or runtime suspend Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org