summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-21mm/page_ext: add common function to get client data from page_extKemeng Shi
Patch series "add page_ext_data to get client data in page_ext". Current clients get data from page_ext by adding offset which is auto generated in page_ext core and exposes the data layout design inside page_ext core. This series adds a page_ext_data() to hide this from clients. Benefits include: 1. Future clients can call page_ext_data directly instead of defining a new function like get_page_owner to get the data. 2. There is no change to clients if the layout of page_ext data changes. This patch (of 3): Add common page_ext_data function to get client data. This could hide offset which is auto generated in page_ext core and expose the desgin of page_ext data layout. Link: https://lkml.kernel.org/r/20230718145812.1991717-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230718145812.1991717-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21zswap: make zswap_load() take a folioMatthew Wilcox (Oracle)
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. Removes three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21swap: remove some calls to compound_head() in swap_readpage()Matthew Wilcox (Oracle)
Replace six implicit calls to compound_head() with one call to page_folio(). Link: https://lkml.kernel.org/r/20230715042343.434588-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21memcg: convert get_obj_cgroup_from_page to get_obj_cgroup_from_folioMatthew Wilcox (Oracle)
As the one caller now has a folio, pass it in and use it. Removes three calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21zswap: make zswap_store() take a folioMatthew Wilcox (Oracle)
Patch series "Followup folio conversions for zswap". With frontswap killed, it's worth converting the zswap_load() and zswap_store() functions to take a folio instead of a page pointer. They aren't converted to support large folios, but there are a lot of unnecessary calls to compound_head() that are removed by these patches. This patch (of 4): Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. This does remove a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230715042343.434588-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: kill frontswapJohannes Weiner
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection. [hannes@cmpxchg.org: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org [fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: zswap: multiple zpools supportYosry Ahmed
Support using multiple zpools of the same type in zswap, for concurrency purposes. A fixed number of 32 zpools is suggested by this commit, which was determined empirically. It can be later changed or made into a config option if needed. On a setup with zswap and zsmalloc, comparing a single zpool to 32 zpools shows improvements in the zsmalloc lock contention, especially on the swap out path. The following shows the perf analysis of the swapout path when 10 workloads are simultaneously reclaiming and refaulting tmpfs pages. There are some improvements on the swap in path as well, but less significant. 1 zpool: |--28.99%--zswap_frontswap_store | <snip> | |--8.98%--zpool_map_handle | | | --8.98%--zs_zpool_map | | | --8.95%--zs_map_object | | | --8.38%--_raw_spin_lock | | | --7.39%--queued_spin_lock_slowpath | |--8.82%--zpool_malloc | | | --8.82%--zs_zpool_malloc | | | --8.80%--zs_malloc | | | |--7.21%--_raw_spin_lock | | | | | --6.81%--queued_spin_lock_slowpath <snip> 32 zpools: |--16.73%--zswap_frontswap_store | <snip> | |--1.81%--zpool_malloc | | | --1.81%--zs_zpool_malloc | | | --1.79%--zs_malloc | | | --0.73%--obj_malloc | |--1.06%--zswap_update_total_size | |--0.59%--zpool_map_handle | | | --0.59%--zs_zpool_map | | | --0.57%--zs_map_object | | | --0.51%--_raw_spin_lock <snip> Link: https://lkml.kernel.org/r/20230620194644.3142384-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: Chris Li (Google) <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: multi-gen LRU: don't spin during memcg releaseT.J. Mercier
When a memcg is in the process of being released mem_cgroup_tryget will fail because its reference count has already reached 0. This can happen during reclaim if the memcg has already been offlined, and we reclaim all remaining pages attributed to the offlined memcg. shrink_many attempts to skip the empty memcg in this case, and continue reclaiming from the remaining memcgs in the old generation. If there is only one memcg remaining, or if all remaining memcgs are in the process of being released then shrink_many will spin until all memcgs have finished being released. The release occurs through a workqueue, so it can take a while before kswapd is able to make any further progress. This fix results in reductions in kswapd activity and direct reclaim in a test where 28 apps (working set size > total memory) are repeatedly launched in a random sequence: A B delta ratio(%) allocstall_movable 5962 3539 -2423 -40.64 allocstall_normal 2661 2417 -244 -9.17 kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71 pageoutrun 57365 11750 -45615 -79.52 Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: memory-failure: fix unexpected return value in soft_offline_page()Miaohe Lin
When page_handle_poison() fails to handle the hugepage or free page in retry path, soft_offline_page() will return 0 while -EBUSY is expected in this case. Consequently the user will think soft_offline_page succeeds while it in fact failed. So the user will not try again later in this case. Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21radix tree: remove unused variableArnd Bergmann
Recent versions of clang warn about an unused variable, though older versions saw the 'slot++' as a use and did not warn: radix-tree.c:1136:50: error: parameter 'slot' set but not used [-Werror,-Wunused-but-set-parameter] It's clearly not needed any more, so just remove it. Link: https://lkml.kernel.org/r/20230811131023.2226509-1-arnd@kernel.org Fixes: 3a08cd52c37c7 ("radix tree: Remove multiorder support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Rong Tao <rongtao@cestc.cn> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: add a call to flush_cache_vmap() in vmap_pfn()Alexandre Ghiti
flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")). But actually the patch is aiming at riscv: the riscv specification allows the caching of invalid entries in the TLB, and since we recently removed the vmalloc page fault handling, we now need to emit a tlb shootdown whenever a new vmalloc mapping is emitted (https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/). That's a temporary solution, there are ways to avoid that :) Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Reported-by: Dylan Jhong <dylan@andestech.com> Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Dylan Jhong <dylan@andestech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21selftests/mm: FOLL_LONGTERM need to be updated to 0x100Ayush Jain
After commit 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") FOLL_LONGTERM flag value got updated from 0x10000 to 0x100 at include/linux/mm_types.h. As hmm.hmm_device_private.hmm_gup_test uses FOLL_LONGTERM Updating same here as well. Before this change test goes in an infinite assert loop in hmm.hmm_device_private.hmm_gup_test ========================================================== RUN hmm.hmm_device_private.hmm_gup_test ... hmm-tests.c:1962:hmm_gup_test:Expected HMM_DMIRROR_PROT_WRITE.. ..(2) == m[2] (34) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) ... ========================================================== Call Trace: <TASK> ? sched_clock+0xd/0x20 ? __lock_acquire.constprop.0+0x120/0x6c0 ? ktime_get+0x2c/0xd0 ? sched_clock+0xd/0x20 ? local_clock+0x12/0xd0 ? lock_release+0x26e/0x3b0 pin_user_pages_fast+0x4c/0x70 gup_test_ioctl+0x4ff/0xbb0 ? gup_test_ioctl+0x68c/0xbb0 __x64_sys_ioctl+0x99/0xd0 do_syscall_64+0x60/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? irqentry_exit_to_user_mode+0xd/0x20 ? irqentry_exit+0x3f/0x50 ? exc_page_fault+0x96/0x200 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6aaa31aaff After this change test is able to pass successfully. Link: https://lkml.kernel.org/r/20230808124347.79163-1-ayush.jain3@amd.com Fixes: 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") Signed-off-by: Ayush Jain <ayush.jain3@amd.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi
A syzbot stress test reported that create_empty_buffers() called from nilfs_lookup_dirty_data_buffers() can cause a general protection fault. Analysis using its reproducer revealed that the back reference "mapping" from a page/folio has been changed to NULL after dirty page/folio gang lookup in nilfs_lookup_dirty_data_buffers(). Fix this issue by excluding pages/folios from being collected if, after acquiring a lock on each page/folio, its back reference "mapping" differs from the pointer to the address space struct that held the page/folio. Link: https://lkml.kernel.org/r/20230805132038.6435-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0ad741797f4565e7e2d2@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000002930a705fc32b231@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via ↵David Hildenbrand
GUP-fast In contrast to most other GUP code, GUP-fast common page table walking code like gup_pte_range() also handles hugetlb pages. But in contrast to other hugetlb page table walking code, it does not look at the hugetlb PTE abstraction whereby we have only a single logical hugetlb PTE per hugetlb page, even when using multiple cont-PTEs underneath -- which is for example what huge_ptep_get() abstracts. So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast might stumble over a PTE that does not map the head page of a hugetlb page -- not the first "head" PTE of such a cont mapping. Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we might end up calling gup_must_unshare() with a tail page of a hugetlb page. We only maintain a single PageAnonExclusive flag per hugetlb page (as hugetlb pages cannot get partially COW-shared), stored for the head page. That flag is clear for all tail pages. So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail page of a hugetlb page: 1) With CONFIG_DEBUG_VM_PGFLAGS Stumbles over the: VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); For example, when executing the COW selftests with 64k hugetlb pages on arm64: [ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11 [ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2 [ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff) [ 61.084101] page_type: 0xffffffff() [ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000 [ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 [ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71 [ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000 [ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page)) [ 61.086914] ------------[ cut here ]------------ [ 61.087220] kernel BUG at include/linux/page-flags.h:990! [ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 61.087999] Modules linked in: ... [ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3 [ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98 [ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98 [ 61.091592] sp : ffff8000825eb940 [ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440 [ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000 [ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58 [ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff [ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548 [ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021 [ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0 [ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8 [ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000 [ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046 [ 61.096894] Call trace: [ 61.097080] gup_must_unshare.part.0+0x64/0x98 [ 61.097392] gup_pte_range+0x3a8/0x3f0 [ 61.097662] gup_pgd_range+0x1ec/0x280 [ 61.097942] lockless_pages_from_mm+0x64/0x1a0 [ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0 [ 61.098612] pin_user_pages_fast+0x58/0x78 [ 61.098917] pin_longterm_test_start+0xf4/0x2b8 [ 61.099243] gup_test_ioctl+0x170/0x3b0 [ 61.099528] __arm64_sys_ioctl+0xa8/0xf0 [ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0 [ 61.100160] el0_svc_common.constprop.0+0xe8/0x100 [ 61.100500] do_el0_svc+0x38/0xa0 [ 61.100736] el0_svc+0x3c/0x198 [ 61.100971] el0t_64_sync_handler+0x134/0x150 [ 61.101280] el0t_64_sync+0x17c/0x180 [ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000) 2) Without CONFIG_DEBUG_VM_PGFLAGS Always detects "not exclusive" for passed tail pages and refuses to PIN the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback to ordinary GUP. As ordinary GUP properly considers the logical hugetlb PTE abstraction in hugetlb_follow_page_mask(), pinning the page will succeed when looking at the PageAnonExclusive on the head page only. So the only real effect of this is that with cont-PTE hugetlb pages, we'll always fallback from GUP-fast to ordinary GUP when not working on the head page, which ends up checking the head page and do the right thing. Consequently, the cow selftests pass with cont-PTE hugetlb pages as well without CONFIG_DEBUG_VM_PGFLAGS. Note that this only applies to anon hugetlb pages that are mapped using cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel. ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into the page table R/O using GUP-fast. On production kernels (and even most debug kernels, that don't set CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required to be backported. But of course, it does not hurt. Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21selftests: cgroup: fix test_kmem_basic less than errorLucas Karpinski
test_kmem_basic creates 100,000 negative dentries, with each one mapping to a slab object. After memory.high is set, these are reclaimed through the shrink_slab function call which reclaims all 100,000 entries. The test passes the majority of the time because when slab1 or current is calculated, it is often above 0, however, 0 is also an acceptable value. Link: https://lkml.kernel.org/r/7d6gcuyzdjcice6qbphrmpmv5skr5jtglg375unnjxqhstvhxc@qkn6dw6bao6v Signed-off-by: Lucas Karpinski <lkarpins@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: liubo <liubo254@huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx@redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21net/mlx5: Convert PCI error values to generic errnosIlpo Järvinen
mlx5_pci_link_toggle() returns a mix of PCI-specific error codes and generic errnos. Convert the PCI-specific error values to generic errno using pcibios_err_to_errno() before returning them. Fixes: eabe8e5e88f5 ("net/mlx5: Handle sync reset now event") Fixes: 212b4d7251c1 ("net/mlx5: Wait for firmware to enable CRS before pci_restore_state") Link: https://lore.kernel.org/r/20230814132721.26608-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> [bhelgaas: rebase to pci/pcie-rmw, also convert in mlx5_check_dev_ids()] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-08-21PCI: Document the Capability accessor RMW improvementsIlpo Järvinen
Documentation claims port service drivers should play nice with respect to PCIe Capability changes, but the concurrency control is now provided in the Capability accessors as long as the correct ones are used. Update the documention to match the RMW accessor behavior. Link: https://lore.kernel.org/r/20230717120503.15276-12-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-08-21wifi: ath10k: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL. ASPM policy changes can trigger write to LNKCTL outside of driver's control. Use RMW capability accessors which does proper locking to avoid losing concurrent updates to the register value. On restore, clear the ASPMC field properly. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: 76d870ed09ab ("ath10k: enable ASPM") Link: https://lore.kernel.org/r/20230717120503.15276-11-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Kalle Valo <kvalo@kernel.org>
2023-08-21wifi: ath12k: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL. ASPM policy changes can trigger write to LNKCTL outside of driver's control. Use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register value. On restore, clear the ASPMC field properly. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: d889913205cf ("wifi: ath12k: driver for Qualcomm Wi-Fi 7 devices") Link: https://lore.kernel.org/r/20230717120503.15276-10-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Kalle Valo <kvalo@kernel.org>
2023-08-21wifi: ath11k: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL. ASPM policy changes can trigger write to LNKCTL outside of driver's control. Use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register value. On restore, clear the ASPMC field properly. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: e9603f4bdcc0 ("ath11k: pci: disable ASPM L0sLs before downloading firmware") Link: https://lore.kernel.org/r/20230717120503.15276-9-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Kalle Valo <kvalo@kernel.org>
2023-08-21net/mlx5: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL of the upstream bridge. ASPM policy changes can trigger write to LNKCTL outside of driver's control. Use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register value. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: eabe8e5e88f5 ("net/mlx5: Handle sync reset now event") Link: https://lore.kernel.org/r/20230717120503.15276-8-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com>
2023-08-21drm/radeon: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL. ASPM policy changes can trigger write to LNKCTL outside of driver's control. And in the case of upstream bridge, the driver does not even own the device it's changing the registers for. Use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register value. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: 8a7cd27679d0 ("drm/radeon/cik: add support for pcie gen1/2/3 switching") Fixes: b9d305dfb66c ("drm/radeon: implement pcie gen2/3 support for SI") Link: https://lore.kernel.org/r/20230717120503.15276-7-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-21cpufreq: intel_pstate: set stale CPU frequency to minimumDoug Smythies
The intel_pstate CPU frequency scaling driver does not use policy->cur and it is 0. When the CPU frequency is outdated arch_freq_get_on_cpu() will default to the nominal clock frequency when its call to cpufreq_quick_getpolicy_cur returns the never updated 0. Thus, the listed frequency might be outside of currently set limits. Some users are complaining about the high reported frequency, albeit stale, when their system is idle and/or it is above the reduced maximum they have set. This patch will maintain policy_cur for the intel_pstate driver at the current minimum CPU frequency. Reported-by: Yang Jie <yang.jie@linux.intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217597 Signed-off-by: Doug Smythies <dsmythies@telus.net> [ rjw: White space damage fixes and comment adjustment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21Revert "pds_core: Fix some kernel-doc comments"Jakub Kicinski
This reverts commit cb39c35783f26892bb1a72b1115c94fa2e77f4c5. Patch was applied to hastily, the problem is already fixed in Alex's vfio tree: https://lore.kernel.org/all/20230821112237.105872b5.alex.williamson@redhat.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-21drm/amdgpu: Use RMW accessors for changing LNKCTLIlpo Järvinen
Don't assume that only the driver would be accessing LNKCTL. ASPM policy changes can trigger write to LNKCTL outside of driver's control. And in the case of upstream bridge, the driver does not even own the device it's changing the registers for. Use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register value. Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: a2e73f56fa62 ("drm/amdgpu: Add support for CIK parts") Fixes: 62a37553414a ("drm/amdgpu: add si implementation v10") Link: https://lore.kernel.org/r/20230717120503.15276-6-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-21cpufreq: stats: Improve the performance of cpufreq_stats_create_table()Liao Chang
In the worst case, the freq_table of policy data is not sorted and contains duplicate frequencies, this means that it needs to iterate through the entire freq_table of policy to ensure each frequency is unique in the freq_table of stats data, this has a time complexity of O(N^2), where N is the number of frequencies in the freq_table of policy. However, if the policy.freq_table is already sorted and contains no duplicate frequencies, it can reduce the time complexity of creating stats.freq_table to O(N), the 'freq_table_sorted' field of policy data can be used to indicate whether the policy.freq_table is sorted. Signed-off-by: Liao Chang <liaochang1@huawei.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> [ rjw: Fix typo in changelog, remove redundant parens ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21nsproxy: Convert nsproxy.count to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230818041327.gonna.210-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-21ACPI: x86: s2idle: Add a function to get LPS0 constraint for a deviceMario Limonciello
Other parts of the kernel may use LPS0 constraints information to make decisions on what power state to put a device into. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> [ rjw: Rewrite kerneldoc, rearrange if () statement, edit subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: x86: s2idle: Add for_each_lpi_constraint() helperAndy Shevchenko
We have one existing and one coming user of this macro. Introduce a helper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: x86: s2idle: Add more debugging for AMD constraints parsingMario Limonciello
While parsing the constraints show all the entries for the table to aid with debugging other problems later. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: x86: s2idle: Fix a logic error parsing AMD constraints tableMario Limonciello
The constraints table should be resetting the `list` object after running through all of `info_obj` iterations. This adjusts whitespace as well as less code will now be included with each loop. This fixes a functional problem is fixed where a badly formed package in the inner loop may have incorrect data. Fixes: 146f1ed852a8 ("ACPI: PM: s2idle: Add AMD support to handle _DSM") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: x86: s2idle: Catch multiple ACPI_TYPE_PACKAGE objectsMario Limonciello
If a badly constructed firmware includes multiple `ACPI_TYPE_PACKAGE` objects while evaluating the AMD LPS0 _DSM, there will be a memory leak. Explicitly guard against this. Suggested-by: Bjorn Helgaas <helgaas@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: x86: s2idle: Post-increment variables when getting constraintsMario Limonciello
When code uses a pre-increment it makes the reader question "why". In the constraint fetching code there is no reason for the variables to be pre-incremented so adjust to post-increment. No intended functional changes. Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Suggested-by: Bjorn Helgaas <helgaas@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21ACPI: Adjust #ifdef for *_lps0_dev useMario Limonciello
The `#ifdef` for acpi_register_lps0_dev() currently is guarded against `CONFIG_X86`, but actually the functions contained in the block are specifically sleep related functions. Adjust the guard to also check for `CONFIG_SUSPEND`. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-08-21net/mlx5: Devcom, only use devcom after NULL check in mlx5_devcom_send_event()Li Zetao
There is a warning reported by kernel test robot: smatch warnings: drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c:264 mlx5_devcom_send_event() warn: variable dereferenced before IS_ERR check devcom (see line 259) The reason for the warning is that the pointer is used before check, put the assignment to comp after devcom check to silence the warning. Fixes: 88d162b47981 ("net/mlx5: Devcom, Infrastructure changes") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202308041028.AkXYDwJ6-lkp@intel.com/ Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: DR, Supporting inline WQE when possibleItamar Gozlan
In WQE (Work Queue Entry), the two types of data segments memories are pointers and inline data, where inline data is passed directly as part of the WQE. For software steering, the maximal inline size should be less than 2*MLX5_SEND_WQE_BB, i.e., the potential data must fit with the required inline WQE headers. Two consecutive blocks (MLX5_SEND_WQE_BB) are not guaranteed to reside on the same memory page. Hence, writes to MLX5_SEND_WQE_BB should be done separately, i.e., each MLX5_SEND_WQE_BB should be obtained using the mlx5_wq_cyc_get_wqe macro. Signed-off-by: Itamar Gozlan <igozlan@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Rename devlink port ops struct for PFs/VFsJiri Pirko
As this struct is only used for devlink ports created for PF/VF, add it to the name of the variable to distinguish from the SF related ops struct. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Remove VPORT_UPLINK handling from devlink_port.cJiri Pirko
It is not possible that the functions in devlink_port.c are called for uplink port. Remove this leftover code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Call mlx5_esw_offloads_rep_load/unload() for uplink port directlyJiri Pirko
For uplink port, mlx5_esw_offloads_load/unload_rep() are currently called. There are 2 check inside, which effectively make the functions a simple wrappers of mlx5_esw_offloads_rep_load/unload() for uplink port. So avoid one check and indirection and call mlx5_esw_offloads_rep_load/unload() for uplink port directly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Update dead links in Kconfig documentationRahul Rameshbabu
Point to NVIDIA documentation for device specific information now that the Mellanox documentation site is deprecated. Refer to kernel documentation sources for generic information not specific to mlx5 devices. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Remove health syndrome enum duplicationGal Pressman
Health syndrome enum values were duplicated in mlx5_ifc and health.c, the correct place for them is mlx5_ifc. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: DR, Remove unneeded local variableYevgeny Kliteynik
Remove local variable that is already defined outside of the scope of this block. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: DR, Fix code indentationYevgeny Kliteynik
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: IRQ, consolidate irq and affinity mask allocationSaeed Mahameed
Consolidate the mlx5_irq and mlx5_irq->mask allocation, to simplify error flows and to match the dealloctation sequence @irq_release for symmetry. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com>
2023-08-21net/mlx5e: Fix spelling mistake "Faided" -> "Failed"Colin Ian King
There is a spelling mistake in a warning message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5e: aRFS, Introduce ethtool statsAdham Faris
Improve aRFS observability by adding new set of counters. Each Rx ring will have this set of counters listed below. These counters are exposed through ethtool -S. 1) arfs_add: number of times a new rule has been created. 2) arfs_request_in: number of times a rule was requested to move from its current Rx ring to a new Rx ring (incremented on the destination Rx ring). 3) arfs_request_out: number of times a rule was requested to move out from its current Rx ring (incremented on source/current Rx ring). 4) arfs_expired: number of times a rule has been expired by the kernel and removed from HW. 5) arfs_err: number of times a rule creation or modification has failed. This patch removes rx[i]_xsk_arfs_err counter and its documentation in mlx5/counters.rst since aRFS activity does not occur in XSK RQ's. Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com>
2023-08-21net/mlx5e: aRFS, Warn if aRFS table does not exist for aRFS ruleAdham Faris
aRFS tables should be allocated and exist in advance. Driver shouldn't reach a point where it tries to add aRFS rule to table that does not exist. Add warning if driver encounters such situation. Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>