summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2022-09-11memcg: increase MEMCG_CHARGE_BATCH to 64Shakeel Butt
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. Though it does have impact on rstat flushing and high limit reclaim backoff. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: page_counter: rearrange struct page_counter fieldsShakeel Butt
With memcg v2 enabled, memcg->memory.usage is a very hot member for the workloads doing memcg charging on multiple CPUs concurrently. Particularly the network intensive workloads. In addition, there is a false cache sharing between memory.usage and memory.high on the charge path. This patch moves the usage into a separate cacheline and move all the read most fields into separate cacheline. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 12413.7 Mbps (18.4% improvement) With the patch, the throughput improved by 18.4%. One side-effect of this patch is the increase in the size of struct mem_cgroup. For example with this patch on 64 bit build, the size of struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for the performance improvement, this additional size is worth it. In addition there are opportunities to reduce the size of struct mem_cgroup like deprecation of kmem and tcpmem page counters and better packing. Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: pagewalk: fix documentation of PTE hole handlingRolf Eike Beer
Empty PTEs are passed to the pte_entry callback, not to pte_hole. Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047 Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: fix use-after free of page_ext after race with memory-offlineCharan Teja Kalla
The below is one path where race between page_ext and offline of the respective memory blocks will cause use-after-free on the access of page_ext structure. process1 process2 --------- --------- a)doing /proc/page_owner doing memory offline through offline_pages. b) PageBuddy check is failed thus proceed to get the page_owner information through page_ext access. page_ext = lookup_page_ext(page); migrate_pages(); ................. Since all pages are successfully migrated as part of the offline operation,send MEM_OFFLINE notification where for page_ext it calls: offline_page_ext()--> __free_page_ext()--> free_page_ext()--> vfree(ms->page_ext) mem_section->page_ext = NULL c) Check for the PAGE_EXT flags in the page_ext->flags access results into the use-after-free (leading to the translation faults). As mentioned above, there is really no synchronization between page_ext access and its freeing in the memory_offline. The memory offline steps(roughly) on a memory block is as below: 1) Isolate all the pages 2) while(1) try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) 3) delete the pages from this buddy list. 4) Then free page_ext.(Note: The struct page is still alive as it is freed only during hot remove of the memory which frees the memmap, which steps the user might not perform). This design leads to the state where struct page is alive but the struct page_ext is freed, where the later is ideally part of the former which just representing the page_flags (check [3] for why this design is chosen). The abovementioned race is just one example __but the problem persists in the other paths too involving page_ext->flags access(eg: page_is_idle())__. Fix all the paths where offline races with page_ext access by maintaining synchronization with rcu lock and is achieved in 3 steps: 1) Invalidate all the page_ext's of the sections of a memory block by storing a flag in the LSB of mem_section->page_ext. 2) Wait until all the existing readers to finish working with the ->page_ext's with synchronize_rcu(). Any parallel process that starts after this call will not get page_ext, through lookup_page_ext(), for the block parallel offline operation is being performed. 3) Now safely free all sections ->page_ext's of the block on which offline operation is being performed. Note: If synchronize_rcu() takes time then optimizations can be done in this path through call_rcu()[2]. Thanks to David Hildenbrand for his views/suggestions on the initial discussion[1] and Pavan kondeti for various inputs on this patch. [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/ [quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David] Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: kill find_min_pfn_with_active_regions()Kefeng Wang
find_min_pfn_with_active_regions() is only called from free_area_init(). Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(), and kill find_min_pfn_with_active_regions(). Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDERZi Yan
This Kconfig option is used by individual arch to set its desired MAX_ORDER. Rename it to reflect its actual use. Link: https://lkml.kernel.org/r/20220815143959.1511278-1-zi.yan@sent.com Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Taichi Sugaya <sugaya.taichi@socionext.com> Cc: Neil Armstrong <narmstrong@baylibre.com> Cc: Qin Jian <qinjian@cqplus1.com> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: adjust hot threshold automaticallyHuang Ying
The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: rate limit NUMA migration throughputHuang Ying
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: hot page selection with hint page fault latencyHuang Ying
Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: add more BUILD_BUG_ONs to gfp_migratetype()Peter Collingbourne
gfp_migratetype() also expects GFP_RECLAIMABLE and GFP_MOVABLE|GFP_RECLAIMABLE to be shiftable into MIGRATE_* enum values, so add some more BUILD_BUG_ONs to reflect this assumption. Link: https://linux-review.googlesource.com/id/Iae64e2182f75c3aca776a486b71a72571d66d83e Link: https://lkml.kernel.org/r/20220726230241.3770532-1-pcc@google.com Signed-off-by: Peter Collingbourne <pcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11hugetlb_cgroup: remove unneeded return valueMiaohe Lin
The return value of set_hugetlb_cgroup and set_hugetlb_cgroup_rsvd are always ignored. Remove them to clean up the code. Link: https://lkml.kernel.org/r/20220729080106.12752-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kfence: add sysfs interface to disable kfence for selected slabs.Imran Khan
By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/hugetlb: add dedicated func to get 'allowed' nodemask for current processFeng Tang
Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes"), the policy_nodemask_current()'s semantics for this new policy has been changed, which returns 'preferred' nodes instead of 'allowed' nodes. With the changed semantic of policy_nodemask_current, a task with MPOL_PREFERRED_MANY policy could fail to get its reservation even though it can fall back to other nodes (either defined by cpusets or all online nodes) for that reservation failing mmap calles unnecessarily early. The fix is to not consider MPOL_PREFERRED_MANY for reservations at all because they, unlike MPOL_MBIND, do not pose any actual hard constrain. Michal suggested the policy_nodemask_current() is only used by hugetlb, and could be moved to hugetlb code with more explicit name to enforce the 'allowed' semantics for which only MPOL_BIND policy matters. apply_policy_zone() is made extern to be called in hugetlb code and its return value is changed to bool. [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") Signed-off-by: Feng Tang <feng.tang@intel.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ben Widawsky <bwidawsk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/vmscan: define macros for refaults in struct lruvecYang Yang
The magic number 0 and 1 are used in several places in vmscan.c. Define macros for them to improve code readability. Link: https://lkml.kernel.org/r/20220808005644.1721066-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11userfaultfd: add /dev/userfaultfd for fine grained access controlAxel Rasmussen
Historically, it has been shown that intercepting kernel faults with userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of time) can be exploited, or at least can make some kinds of exploits easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we changed things so, in order for kernel faults to be handled by userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must be configured so that any unprivileged user can do it. In a typical implementation of a hypervisor with live migration (take QEMU/KVM as one such example), we do indeed need to be able to handle kernel faults. But, both options above are less than ideal: - Toggling the sysctl increases attack surface by allowing any unprivileged user to do it. - Granting the live migration process CAP_SYS_PTRACE gives it this ability, but *also* the ability to "observe and control the execution of another process [...], and examine and change [its] memory and registers" (from ptrace(2)). This isn't something we need or want to be able to do, so granting this permission violates the "principle of least privilege". This is all a long winded way to say: we want a more fine-grained way to grant access to userfaultfd, without granting other additional permissions at the same time. To achieve this, add a /dev/userfaultfd misc device. This device provides an alternative to the userfaultfd(2) syscall for the creation of new userfaultfds. The idea is, any userfaultfds created this way will be able to handle kernel faults, without the caller having any special capabilities. Access to this mechanism is instead restricted using e.g. standard filesystem permissions. [axelrasmussen@google.com: Handle misc_register() failure properly] Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/madvise: introduce MADV_COLLAPSE sync hugepage collapseZach O'Keefe
This idea was introduced by David Rientjes[1]. Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory at their own expense. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, process_madvise(2) returns the number of bytes advised, and madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Most notable here is ENOMEM and EBUSY (new to madvise) which are intended to provide the caller with actionable feedback so they may take an appropriate fallback measure. Use Cases An immediate user of this new functionality are malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[2]. Only privately-mapped anon memory is supported for now, but additional support for file, shmem, and HugeTLB high-granularity mappings[2] is expected. File and tmpfs/shmem support would permit: * Backing executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. * Backing guest memory by hugapages after the memory contents have been migrated in native-page-sized chunks to a new host, in a userfaultfd-based live-migration stack. [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ [2] https://github.com/google/tcmalloc/tree/master/tcmalloc [jrdr.linux@gmail.com: avoid possible memory leak in failure path] Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com [zokeefe@google.com add missing kfree() to madvise_collapse()] Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/ Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com [zokeefe@google.com: delay computation of hpage boundaries until use]] Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Signed-off-by: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepageZach O'Keefe
When scanning an anon pmd to see if it's eligible for collapse, return SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the file-collapse path, since the latter might identify pte-mapped compound pages. This is required by MADV_COLLAPSE which necessarily needs to know what hugepage-aligned/sized regions are already pmd-mapped. In order to determine if a pmd already maps a hugepage, refactor mm_find_pmd(): Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race with THP fault") behavior. ksm was the only caller that explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic there (pmd_present() and pmd_trans_huge() checks). Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race with THP fault") that open-coded split_huge_pmd_address() pmd lookup and use mm_find_pmd() instead. Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()Zach O'Keefe
MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1]. hugepage_vma_check() is the authority on determining if a VMA is eligible for THP allocation/collapse, and currently enforces the sysfs THP settings. Add a flag to disable these checks. For now, only apply this arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. We can expand this to shmem, which uses /sys/kernel/transparent_hugepage/shmem_enabled, later. Use this flag in collapse_pte_mapped_thp() where previously the VMA flags passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check THP flag in hugepage_vma_check()", this check also didn't check "never" THP mode. As such, this restores the previous behavior of collapse_pte_mapped_thp() where sysfs THP settings are ignored. See comment in code for justification why this is OK. [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGNEric Biggers
To prepare for STATX_DIOALIGN support, make two changes to fscrypt_dio_supported(). First, remove the filesystem-block-alignment check and make the filesystems handle it instead. It previously made sense to have it in fs/crypto/; however, to support STATX_DIOALIGN the alignment restriction would have to be returned to filesystems. It ends up being simpler if filesystems handle this part themselves, especially for f2fs which only allows fs-block-aligned DIO in the first place. Second, make fscrypt_dio_supported() work on inodes whose encryption key hasn't been set up yet, by making it set up the key if needed. This is required for statx(), since statx() doesn't require a file descriptor. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220827065851.135710-4-ebiggers@kernel.org
2022-09-11vfs: support STATX_DIOALIGN on block devicesEric Biggers
Add support for STATX_DIOALIGN to block devices, so that direct I/O alignment restrictions are exposed to userspace in a generic way. Note that this breaks the tradition of stat operating only on the block device node, not the block device itself. However, it was felt that doing this is preferable, in order to make the interface useful and avoid needing separate interfaces for regular files and block devices. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220827065851.135710-3-ebiggers@kernel.org
2022-09-11statx: add direct I/O alignment informationEric Biggers
Traditionally, the conditions for when DIO (direct I/O) is supported were fairly simple. For both block devices and regular files, DIO had to be aligned to the logical block size of the block device. However, due to filesystem features that have been added over time (e.g. multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode), the conditions for when DIO is allowed on a regular file have gotten increasingly complex. Whether a particular regular file supports DIO, and with what alignment, can depend on various file attributes and filesystem mount options, as well as which block device(s) the file's data is located on. Moreover, the general rule of DIO needing to be aligned to the block device's logical block size was recently relaxed to allow user buffers (but not file offsets) aligned to the DMA alignment instead. See commit bf8d08532bc1 ("iomap: add support for dma aligned direct-io"). XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information. Uplifting this to the VFS is one possibility. However, as discussed (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u), this ioctl is rarely used and not known to be used outside of XFS-specific code. It was also never intended to indicate when a file doesn't support DIO at all, nor was it intended for block devices. Therefore, let's expose this information via statx(). Add the STATX_DIOALIGN flag and two new statx fields associated with it: * stx_dio_mem_align: the alignment (in bytes) required for user memory buffers for DIO, or 0 if DIO is not supported on the file. * stx_dio_offset_align: the alignment (in bytes) required for file offsets and I/O segment lengths for DIO, or 0 if DIO is not supported on the file. This will only be nonzero if stx_dio_mem_align is nonzero, and vice versa. Note that as with other statx() extensions, if STATX_DIOALIGN isn't set in the returned statx struct, then these new fields won't be filled in. This will happen if the file is neither a regular file nor a block device, or if the file is a regular file and the filesystem doesn't support STATX_DIOALIGN. It might also happen if the caller didn't include STATX_DIOALIGN in the request mask, since statx() isn't required to return unrequested information. This commit only adds the VFS-level plumbing for STATX_DIOALIGN. For regular files, individual filesystems will still need to add code to support it. For block devices, a separate commit will wire it up too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220827065851.135710-2-ebiggers@kernel.org
2022-09-11mm/memory-failure: fix detection of memory_failure() handlersDan Williams
Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even have pagemap ops which results in crash signatures like this: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 22 PID: 4535 Comm: device-dax Tainted: G OE N 6.0.0-rc2+ #59 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:memory_failure+0x667/0xba0 [..] Call Trace: <TASK> ? _printk+0x58/0x73 do_madvise.part.0.cold+0xaf/0xc5 Check for ops before checking if the ops have a memory_failure() handler. Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11Merge tag 'drm-misc-next-2022-09-09' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v6.1-rc1: [airlied - fix sun4i_tv build] UAPI Changes: - Hide unregistered connectors from GETCONNECTOR ioctl. - drm/virtio no longer advertises LINEAR modifier, as it doesn't work. - Cross-subsystem Changes: - Fix GPF in udmabuf failure path. Core Changes: - Rework TTM placement to use intersect/compatible functions. - Drop legacy DP-MST support. - More DP-MST related fixes, and move all state into atomic. - Make DRM_MIPI_DBI select DRM_KMS_HELPER. - Add audio_infoframe packing for DP. - Add logging when some atomic check functions fail. - Assorted documentation updates and fixes. Driver Changes: - Assorted cleanups and fixes in msm, lcdif, nouveau, virtio, panel/ilitek, bridge/icn6211, tve200, gma500, bridge/*, panfrost, via, bochs, qxl, sun4i. - Add add AUO B133UAN02.1, IVO M133NW4J-R3, Innolux N120ACA-EA1 eDP panels. - Improve DP-MST modeset state handling in amdgpu, nouveau, i915. - Drop DP-MST from radeon driver, it was broken and only user of legacy DP-MST. - Handle unplugging better in vc4. - Simplify drm cmdparser tests. - Add DP support to ti-sn65dsi86. - Add MT8195 DP support to mediatek. - Support RGB565, XRGB64, and ARGB64 formats in vkms. - Convert sun4i tv support to atomic. - Refactor vc4/vec TV Modesetting, and fix timings. - Use atomic helpers instead of simple display helpers in ssd130x. Maintainer changes: - Add Douglas Anderson as reviewer for panel-edp. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/a489485b-3ebc-c734-0f80-aed963d89efe@linux.intel.com
2022-09-11Merge tag 'iommu-fixes-v6.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes from Lu Baolu: - Boot kdump kernels with VT-d scalable mode on - Calculate the right page table levels - Fix two recursive locking issues - Fix a lockdep splat issue - AMD IOMMU fixes: - Fix for completion-wait command to use full 64 bits of data - Fix PASID related issue where GPU sound devices failed to initialize - Fix for Virtio-IOMMU to report correct caching behavior, needed for use with VFIO * tag 'iommu-fixes-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu: Fix false ownership failure on AMD systems with PASID activated iommu/vt-d: Fix possible recursive locking in intel_iommu_init() iommu/virtio: Fix interaction with VFIO iommu/vt-d: Fix lockdep splat due to klist iteration in atomic context iommu/vt-d: Fix recursive lock issue in iommu_flush_dev_iotlb() iommu/vt-d: Correctly calculate sagaw value of IOMMU iommu/vt-d: Fix kdump kernels boot failure with scalable mode iommu/amd: use full 64-bit value in build_completion_wait()
2022-09-11power: supply: Explain maintenance chargingLinus Walleij
In order for everyone to understand clearly why we want to use maintenance charging for batteries, expand the description with two diagrams and some text. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2022-09-11iommu/vt-d: Fix possible recursive locking in intel_iommu_init()Lu Baolu
The global rwsem dmar_global_lock was introduced by commit 3a5670e8ac932 ("iommu/vt-d: Introduce a rwsem to protect global data structures"). It is used to protect DMAR related global data from DMAR hotplug operations. The dmar_global_lock used in the intel_iommu_init() might cause recursive locking issue, for example, intel_iommu_get_resv_regions() is taking the dmar_global_lock from within a section where intel_iommu_init() already holds it via probe_acpi_namespace_devices(). Using dmar_global_lock in intel_iommu_init() could be relaxed since it is unlikely that any IO board must be hot added before the IOMMU subsystem is initialized. This eliminates the possible recursive locking issue by moving down DMAR hotplug support after the IOMMU is initialized and removing the uses of dmar_global_lock in intel_iommu_init(). Fixes: d5692d4af08cd ("iommu/vt-d: Fix suspicious RCU usage in probe_acpi_namespace_devices()") Reported-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/894db0ccae854b35c73814485569b634237b5538.1657034828.git.robin.murphy@arm.com Link: https://lore.kernel.org/r/20220718235325.3952426-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-09-10bpf: Add verifier support for custom callback return rangeDave Marchevsky
Verifier logic to confirm that a callback function returns 0 or 1 was added in commit 69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper"). At the time, callback return value was only used to continue or stop iteration. In order to support callbacks with a broader return value range, such as those added in rbtree series[0] and others, add a callback_ret_range to bpf_func_state. Verifier's helpers which set in_callback_fn will also set the new field, which the verifier will later use to check return value bounds. Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel value as the latter would prevent the valid range (0, U64_MAX) being used. Previous global default tnum_range(0, 1) is explicitly set for extant callback helpers. The change to global default was made after discussion around this patch in rbtree series [1], goal here is to make it more obvious that callback_ret_range should be explicitly set. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com/ [1]: lore.kernel.org/bpf/20220830172759.4069786-2-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Add support for writing to nf_conn:markDaniel Xu
Support direct writes to nf_conn:mark from TC and XDP prog types. This is useful when applications want to store per-connection metadata. This is also particularly useful for applications that run both bpf and iptables/nftables because the latter can trivially access this metadata. One example use case would be if a bpf prog is responsible for advanced packet classification and iptables/nftables is later used for routing due to pre-existing/legacy code. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Add stub for btf_struct_access()Daniel Xu
Add corresponding unimplemented stub for when CONFIG_BPF_SYSCALL=n Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/4021398e884433b1fef57a4d28361bb9fcf1bd05.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10ACPI: resource: Add helper function acpi_dev_get_memory_resources()Heikki Krogerus
Wrapper function that finds all memory type resources by using acpi_dev_get_resources(). It removes the need for the drivers to check the resource data type separately. Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-09-10Merge tag 'dma-mapping-6.0-2022-09-10' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - revert a panic on swiotlb initialization failure (Yu Zhao) - fix the lookup for partial syncs in dma-debug (Robin Murphy) - fix a shift overflow in swiotlb (Chao Gao) - fix a comment typo in swiotlb (Chao Gao) - mark a function static now that all abusers are gone (Christoph Hellwig) * tag 'dma-mapping-6.0-2022-09-10' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: mark dma_supported static swiotlb: fix a typo swiotlb: avoid potential left shift overflow dma-debug: improve search for partial syncs Revert "swiotlb: panic if nslabs is too small"
2022-09-10spi: Merge tag 'v6.0-rc4' into spi-6.1Mark Brown
Linux 6.0-rc4 so we can test on BeagleBone again.
2022-09-10asm-generic: Remove empty #ifdef SA_RESTORERChristophe Leroy
There was a #ifdef SA_RESTORER to guard the sa_restorer field in struct sigaction. Commit 8a1ab3155c2a ("UAPI: (Scripted) Disintegrate include/asm-generic") moved that struct into uapi/asm-generic/signal.h but the #ifdef SA_RESTORER remained. Remove it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-09-09Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Eight patches which looks like quite a large core change, but most of the diffstat is reverting the attempt to rejig reference counting introduced in the last merge window which caused issues with device and module removal. Of the remaining four patches, only the fix use-after-free is substantial" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: mpt3sas: Fix use-after-free warning scsi: core: Fix a use-after-free scsi: core: Revert "Make sure that targets outlive devices" scsi: core: Revert "Make sure that hosts outlive targets" scsi: core: Revert "Simplify LLD module reference counting" scsi: core: Revert "Call blk_mq_free_tag_set() earlier" scsi: lpfc: Add missing destroy_workqueue() in error path scsi: lpfc: Return DID_TRANSPORT_DISRUPTED instead of DID_REQUEUE
2022-09-09Bluetooth: Fix HCIGETDEVINFO regressionLuiz Augusto von Dentz
Recent changes breaks HCIGETDEVINFO since it changes the size of hci_dev_info. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-09-09Merge tag 'driver-core-6.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and debugfs fixes for 6.0-rc5. Included in here are: - multiple attempts to get the arch_topology code to work properly on non-cluster SMT systems. First attempt caused build breakages in linux-next and 0-day, second try worked. - debugfs fixes for a long-suffering memory leak. The pattern of debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so add debugfs_lookup_and_remove() to fix this problem. Also fix up the scheduler debug code that highlighted this problem. Fixes for other subsystems will be trickling in over the next few months for this same issue once the debugfs function is merged. All of these have been in linux-next since Wednesday with no reported problems" * tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: arch_topology: Make cluster topology span at least SMT CPUs sched/debug: fix dentry leak in update_sched_domain_debugfs debugfs: add debugfs_lookup_and_remove() driver core: fix driver_set_override() issue with empty strings Revert "arch_topology: Make cluster topology span at least SMT CPUs" arch_topology: Make cluster topology span at least SMT CPUs
2022-09-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "Many bug fixes in several drivers: - Fix misuse of the DMA API in rtrs - Several irdma issues: hung task due to SQ flushing, incorrect capability reporting to userspace, improper error handling for MW corners, touching an uninitialized SGL for during invalidation. - hns was using the wrong page size limits for the HW, an incorrect calculation of wqe_shift causing WQE corruption, and mis computed a timer id. - Fix a crash in SRP triggered by blktests - Fix compiler errors by calling virt_to_page() with the proper type in siw - Userspace triggerable deadlock in ODP - mlx5 could use the wrong profile due to some driver loading races, counters were not working in some device configurations, and a crash on error unwind" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/irdma: Report RNR NAK generation in device caps RDMA/irdma: Use s/g array in post send only when its valid RDMA/irdma: Return correct WC error for bind operation failure RDMA/irdma: Return error on MR deregister CQP failure RDMA/irdma: Report the correct max cqes from query device MAINTAINERS: Update maintainers of HiSilicon RoCE RDMA/mlx5: Fix UMR cleanup on error flow of driver init RDMA/mlx5: Set local port to one when accessing counters RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile IB/core: Fix a nested dead lock as part of ODP flow RDMA/siw: Pass a pointer to virt_to_page() RDMA/srp: Set scmnd->result only when scmnd is not NULL RDMA/hns: Remove the num_qpc_timer variable RDMA/hns: Fix wrong fixed value of qp->rq.wqe_shift RDMA/hns: Fix supported page size RDMA/cma: Fix arguments order in net device validation RDMA/irdma: Fix drain SQ hang with no completion RDMA/rtrs-srv: Pass the correct number of entries for dma mapped SGL RDMA/rtrs-clt: Use the right sg_cnt after ib_dma_map_sg
2022-09-09Merge tag 'drm-fixes-2022-09-10' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "From a train in the Irish countryside, regular drm fixes for 6.0-rc5. This is mostly amdgpu/amdkfd and i915 fixes, then one panfrost, one ttm and one edid fix. Nothing too major going on. Hopefully a quiet week next week for LPC. edid: - Fix EDID 1.4 range-descriptor parsing ttm: - Fix ghost-object bulk moves i915: - Fix MIPI sequence block copy from BIOS' table - Fix PCODE min freq setup when GuC's SLPC is in use - Implement Workaround for eDP - Fix has_flat_ccs selection for DG1 amdgpu: - Firmware header fix - SMU 13.x fix - Debugfs memory leak fix - NBIO 7.7 fix - Firmware memory leak fix amdkfd: - Debug output fix panfrost: - Fix devfreq OPP" * tag 'drm-fixes-2022-09-10' of git://anongit.freedesktop.org/drm/drm: drm/panfrost: devfreq: set opp to the recommended one to configure regulator drm/ttm: cleanup the resource of ghost objects after locking them drm/amdgpu: prevent toc firmware memory leak drm/amdgpu: correct doorbell range/size value for CSDMA_DOORBELL_RANGE drm/amdkfd: print address in hex format rather than decimal drm/amd/display: fix memory leak when using debugfs_lookup() drm/amd/pm: add missing SetMGpuFanBoostLimitRpm mapping for SMU 13.0.7 drm/amd/amdgpu: add rlc_firmware_header_v2_4 to amdgpu_firmware_header drm/i915: consider HAS_FLAT_CCS() in needs_ccs_pages drm/i915: Implement WaEdpLinkRateDataReload drm/i915/slpc: Let's fix the PCODE min freq table setup for SLPC drm/i915/bios: Copy the whole MIPI sequence block drm/ttm: update bulk move object of ghost BO drm/edid: Handle EDID 1.4 range descriptor h/vfreq offsets
2022-09-09Merge tag 'linux-kselftest-kunit-fixes-6.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fixes from Shuah Khan: "Two fixes to test build and a fix for incorrect taint reason reporting" * tag 'linux-kselftest-kunit-fixes-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: tools: Add new "test" taint to kernel-chktaint kunit: fix Kconfig for build-in tests USB4 and Nitro Enclaves kunit: fix assert_type for comparison macros
2022-09-09ASoC: SOF: ipc4: Add macro to get core ID from log buffer status messagePeter Ujfalusi
The LOG_BUFFER_STATUS message includes the ID of the core which updated its log buffer. With IPC4 each core logs to a different slot in the debug window. Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Rander Wang <rander.wang@intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Link: https://lore.kernel.org/r/20220909114332.31393-3-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-09-09dt-bindings: power: add power-domain header for rk3588Finley Xiao
Add all the power domains listed in the RK3588 TRM. Signed-off-by: Finley Xiao <finley.xiao@rock-chips.com> Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220906143825.199089-3-sebastian.reichel@collabora.com Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2022-09-09ACPI: s2idle: Add a new ->check() callback for platform_s2idle_opsMario Limonciello
On some platforms it is found that Linux more aggressively enters s2idle than Windows enters Modern Standby and this uncovers some synchronization issues for the platform. To aid in debugging this class of problems in the future, add support for an extra optional callback intended for drivers to emit extra debugging. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220829162953.5947-2-mario.limonciello@amd.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-09-10Merge tag 'drm-misc-fixes-2022-09-08' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Short summary of fixes pull: * edid: Fix EDID 1.4 range-descriptor parsing * panfrost: Fix devfreq OPP * ttm: Fix ghost-object bulk moves Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/YxniKN4rK4qPp+J9@linux-uq9g
2022-09-09drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTESPhil Auld
As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS <= 3. This leads to very large file sizes: topology$ ls -l total 0 -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 core_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 core_cpus_list -r--r--r-- 1 root root 4096 Sep 5 10:58 core_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 core_siblings -r--r--r-- 1 root root 4096 Sep 5 11:59 core_siblings_list -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 die_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 die_cpus_list -r--r--r-- 1 root root 4096 Sep 5 11:59 die_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 11:59 package_cpus -r--r--r-- 1 root root 4096 Sep 5 11:59 package_cpus_list -r--r--r-- 1 root root 4096 Sep 5 10:58 physical_package_id -r--r--r-- 1 root root 18446744073709551615 Sep 5 10:10 thread_siblings -r--r--r-- 1 root root 4096 Sep 5 11:59 thread_siblings_list Adjust the inequality to catch the case when NR_CPUS is configured to a small value. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Yury Norov <yury.norov@gmail.com> Cc: stable@vger.kernel.org Cc: feng xiangjun <fengxj325@gmail.com> Fixes: 7ee951acd31a ("drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist") Reported-by: feng xiangjun <fengxj325@gmail.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-09Merge tag 'asm-generic-fixes-6.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull SOFTIRQ_ON_OWN_STACK rework from Arnd Bergmann: "Just one fixup patch, reworking the softirq_on_own_stack logic for preempt-rt kernels as discussed in https://lore.kernel.org/all/CAHk-=wgZSD3W2y6yczad2Am=EfHYyiPzTn3CfXxrriJf9i5W5w@mail.gmail.com/" * tag 'asm-generic-fixes-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: asm-generic: Conditionally enable do_softirq_own_stack() via Kconfig.
2022-09-09net: mscc: ocelot: share the common stat definitions between all driversVladimir Oltean
All switch families supported by the ocelot lib (ocelot, felix, seville) export the same registers so far. But for example felix also has TSN counters, while the others don't. To reduce the bloat even further, create an OCELOT_COMMON_STATS() macro which just lists all stats that are common between switches. The array elements are still replicated among all of vsc9959_stats_layout, vsc9953_stats_layout and ocelot_stats_layout. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09net: mscc: ocelot: minimize definitions for statsVladimir Oltean
The current definition of struct ocelot_stat_layout is long-winded (4 lines per entry, and we have hundreds of entries), so we could make an effort to use the C preprocessor and reduce the line count. Create an implicit correspondence between enum ocelot_reg, which tells us the register address (SYS_COUNT_RX_OCTETS etc) and enum ocelot_stat which allows us to index the ocelot->stats array (OCELOT_STAT_RX_OCTETS etc), and don't require us to specify both when we define what stats each switch family has. Create an OCELOT_STAT() macro that pairs only an enum ocelot_stat to an enum ocelot_reg, and an OCELOT_STAT_ETHTOOL() macro which also contains a name exported to the unstructured ethtool -S stringset API. For now, we define all counters as having the OCELOT_STAT_ETHTOOL() kind, but we will add more counters in the future which are not exported to the unstructured ethtool -S. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09net: mscc: ocelot: harmonize names of SYS_COUNT_TX_AGING and OCELOT_STAT_TX_AGEDVladimir Oltean
The hardware counter is called C_TX_AGED, so rename SYS_COUNT_TX_AGING to SYS_COUNT_TX_AGED. This will become important since we want to minimize the way in which we declare struct ocelot_stat_layout elements, using the C preprocessor. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09net: mscc: ocelot: add support for all sorts of standardized counters ↵Vladimir Oltean
present in DSA DSA is integrated with the new standardized ethtool -S --groups option, but the felix driver only exports unstructured statistics. Reuse the array of 64-bit statistics collected by ocelot_check_stats_work(), but just export select values from it. Since ocelot_check_stats_work() runs periodically to avoid 32-bit overflow, and the ethtool calling context is sleepable, we update the 64-bit stats one more time, to provide up-to-date values. The locking scheme with a mutex followed by a spinlock is a bit hard to digest, so we create and use a ocelot_port_stats_run() helper with a callback that populates the ethool stats group the caller is interested in. The exported stats are: ethtool -S swp0 --groups eth-phy ethtool -S swp0 --groups eth-mac ethtool -S swp0 --groups eth-ctrl ethtool -S swp0 --groups rmon ethtool --include-statistics --show-pause swp0 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-09net: dsa: felix: use ocelot's ndo_get_stats64 methodVladimir Oltean
Move the logic from the ocelot switchdev driver's ocelot_get_stats64() method to the common switch lib and reuse it for the DSA driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>