summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-13mptcp: fix subflow accounting on closePaolo Abeni
If the PM closes a fully established MPJ subflow or the subflow creation errors out in it's early stage the subflows counter is not bumped accordingly. This change adds the missing accounting, additionally taking care of updating accordingly the 'accept_subflow' flag. Fixes: a88c9e496937 ("mptcp: do not block subflows creation on errors") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13eth: sfc: remove remnants of the out-of-tree napi_weight module paramJakub Kicinski
Remove napi_weight statics which are set to 64 and never modified, remnants of the out-of-tree napi_weight module param. Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220512205603.1536771-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13proc/sysctl: make protected_* world readableJulius Hemanth Pitti
protected_* files have 600 permissions which prevents non-superuser from reading them. Container like "AWS greengrass" refuse to launch unless protected_hardlinks and protected_symlinks are set. When containers like these run with "userns-remap" or "--user" mapping container's root to non-superuser on host, they fail to run due to denied read access to these files. As these protections are hardly a secret, and do not possess any security risk, making them world readable. Though above greengrass usecase needs read access to only protected_hardlinks and protected_symlinks files, setting all other protected_* files to 644 to keep consistency. Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com Fixes: 800179c9b8a1 ("fs: add link restrictions") Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm, compaction: fast_find_migrateblock() should return pfn in the target zoneRei Yamamoto
At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit a984226f457f849e ("mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec"), because it can corrupt the lru list by handling pages in list without holding proper lru_lock. Avoid returning a pfn outside the target zone in the case that it is not aligned with a pageblock boundary. Otherwise isolate_migratepages_block() will handle pages not in the target zone. Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Don Dutile <ddutile@redhat.com> Cc: Wonhyuk Yang <vvghjk1234@gmail.com> Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/damon: add documentation for Enum valueGautam Menghani
Fix the warning - "Enum value 'NR_DAMON_OPS' not described in enum 'damon_ops_id'" generated by the command "make pdfdocs" Link: https://lkml.kernel.org/r/20220508073316.141401-1-gautammenghani201@gmail.com Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/memcontrol: export memcg->watermark via sysfs for v2 memcgGanesan Rajagopal
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data. This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups. Hence, expose memcg->watermark as "memory.peak" for v2 memcg. Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.com Signed-off-by: Ganesan Rajagopal <rganesan@arista.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctlMuchun Song
We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit 099730d67417. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsingMuchun Song
Use kstrtobool rather than open coding "on" and "off" parsing in mm/hugetlb_vmemmap.c, which is more powerful to handle all kinds of parameters like 'Yy1Nn0' or [oO][NnFf] for "on" and "off". Link: https://lkml.kernel.org/r/20220512041142.39501-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=onMuchun Song
Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap on hot added memory. If "hugetlb_free_vmemmap=on" and memory_hotplug.memmap_on_memory" are both passed on the kernel command line, optimizing hugetlb pages takes precedence. However, the global variable memmap_on_memory will still be set to 1, even though we will not try to allocate memmap on hot added memory. Also introduce mhp_memmap_on_memory() helper to move the definition of "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY. In the next patch, mhp_memmap_on_memory() will also be exported to be used in hugetlb_vmemmap.c. Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page ↵Muchun Song
crosses page boundaries Patch series "add hugetlb_optimize_vmemmap sysctl", v11. This series aims to add hugetlb_optimize_vmemmap sysctl to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. This patch (of 4): If the size of "struct page" is not the power of two but with the feature of minimizing overhead of struct page associated with each HugeTLB is enabled, then the vmemmap pages of HugeTLB will be corrupted after remapping (panic is about to happen in theory). But this only exists when !CONFIG_MEMCG && !CONFIG_SLUB on x86_64. However, it is not a conventional configuration nowadays. So it is not a real word issue, just the result of a code review. But we cannot prevent anyone from configuring that combined configure. This hugetlb_optimize_vmemmap should be disable in this case to fix this issue. Link: https://lkml.kernel.org/r/20220512041142.39501-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220512041142.39501-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmappingBaolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When unmapping a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes, so we will nuke only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page. And now try_to_unmap() is only passed a hugetlb page in the case where the hugetlb page is poisoned. Which means now we will unmap only one pte entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can still access other subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page, which will cause serious issues possibly. So we should change to use huge_ptep_clear_flush() to nuke the hugetlb page table to fix this issue, which already considered CONT-PTE and CONT-PMD size hugetlb. We've already used set_huge_swap_pte_at() to set a poisoned swap entry for a poisoned hugetlb page. Meanwhile adding a VM_BUG_ON() to make sure the passed hugetlb page is poisoned in try_to_unmap(). Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migrationBaolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When migrating a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it and remap it with a migration pte entry. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes. So we will nuke or remap only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb migration. The problem is we can still continue to modify the subpages' data of a hugetlb page during migrating a hugetlb page, which can cause a serious data consistent issue, since we did not nuke the page table entry and set a migration pte for the subpages of a hugetlb page. To fix this issue, we should change to use huge_ptep_clear_flush() to nuke a hugetlb page table, and remap it with set_huge_pte_at() and set_huge_swap_pte_at() when migrating a hugetlb page, which already considered the CONT-PTE or CONT-PMD size hugetlb. [akpm@linux-foundation.org: fix nommu build] [baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU] Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: change huge_ptep_clear_flush() to return the original pteBaolin Wang
Patch series "Fix CONT-PTE/PMD size hugetlb issue when unmapping or migrating", v4. presently, migrating a hugetlb page or unmapping a poisoned hugetlb page, we'll use ptep_clear_flush() and set_pte_at() to nuke the page table entry and remap it, and this is incorrect for CONT-PTE or CONT-PMD size hugetlb page, which will cause potential data consistent issue. This patch set will change to use hugetlb related APIs to fix this issue. Note: Mike pointed out the huge_ptep_get() will only return the one specific value, and it would not take into account the dirty or young bits of CONT-PTE/PMDs like the huge_ptep_get_and_clear() [1]. This inconsistent issue is not introduced by this patch set, and this issue will be addressed in another thread [2]. Meanwhile the uffd for hugetlb case [3] pointed out by Gerald also needs another patch to address. [1] https://lore.kernel.org/linux-mm/85bd80b4-b4fd-0d3f-a2e5-149559f2f387@oracle.com/ [2] https://lore.kernel.org/all/cover.1651998586.git.baolin.wang@linux.alibaba.com/ [3] https://lore.kernel.org/linux-mm/20220503120343.6264e126@thinkpad/ This patch (of 3): It is incorrect to use ptep_clear_flush() to nuke a hugetlb page table when unmapping or migrating a hugetlb page, and will change to use huge_ptep_clear_flush() instead in the following patches. So this is a preparation patch, which changes the huge_ptep_clear_flush() to return the original pte to help to nuke a hugetlb page table. [baolin.wang@linux.alibaba.com: fix build in several more architectures] Link: https://lkml.kernel.org/r/0009a4cd-2826-e8be-e671-f050d4f18d5d@linux.alibaba.com [sfr@canb.auug.org.au: fixup] Link: https://lkml.kernel.org/r/20220511181531.7f27a5c1@canb.auug.org.au Link: https://lkml.kernel.org/r/cover.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/20f77ddab90baa249bd24504c413189b82acde69.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1652147571.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/dcf065868cce35bceaf138613ad27f17bb7c0c19.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Rich Felker <dalias@libc.org> Cc: David S. Miller <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Documentation/vm: rework "Temporary Virtual Mappings" sectionFabio M. De Francesco
Extend and rework the "Temporary Virtual Mappings" section of the highmem.rst documentation. Despite the local kmaps were introduced by Thomas Gleixner in October 2020, documentation was still missing information about them. These additions rely largely on Gleixner's patches, Jonathan Corbet's LWN articles, comments by Ira Weiny and Matthew Wilcox, and in-code comments from ./include/linux/highmem.h. 1) Add a paragraph to document kmap_local_page(). 2) Reorder the list of functions by decreasing order of preference of use. 3) Rework part of the kmap() entry in list. Link: https://lkml.kernel.org/r/20220428212455.892-5-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Documentation/vm: move "Using kmap-atomic" to highmem.hFabio M. De Francesco
The use of kmap_atomic() is new code is being deprecated in favor of kmap_local_page(). For this reason the "Using kmap_atomic" section in highmem.rst is obsolete and unnecessary, but it can still help developers if it were moved to kdocs in highmem.h. Therefore, move the relevant parts of this section from highmem.rst and merge them with the kdocs in highmem.h. Link: https://lkml.kernel.org/r/20220428212455.892-4-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Documentation/vm: include kdocs from highmem*.h into highmem.rstFabio M. De Francesco
kernel-docs that are in include/linux/highmem.h and in include/linux/highmem-internal.h should be included in highmem.rst. Use kdocs directives to include the above-mentioned comments into highmem.rst. Link: https://lkml.kernel.org/r/20220428212455.892-3-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/highmem: fix kernel-doc warnings in highmem*.hFabio M. De Francesco
Patch series "Extend and reorganize Highmem's documentation", v4. This series has the purpose to extend and reorganize Highmem's documentation. This is a work in progress because some information should still be moved from highmem.rst to highmem.h and highmem-internal.h. Specifically I'm talking about moving the "how to" information to the relevant headers, as it as been suggested by Ira Weiny (Intel). Also, this is a work in progress because some kdocs in highmem.h and highmem-internal.h should be improved. This patch (of 4): `scripts/kernel-doc -v -none include/linux/highmem*` reports the following warnings: include/linux/highmem.h:160: warning: expecting prototype for kunmap_atomic(). Prototype was for nr_free_highpages() instead include/linux/highmem.h:204: warning: No description found for return value of 'alloc_zeroed_user_highpage_movable' include/linux/highmem-internal.h:256: warning: Function parameter or member '__addr' not described in 'kunmap_atomic' include/linux/highmem-internal.h:256: warning: Excess function parameter 'addr' description in 'kunmap_atomic' Fix these warnings by (1) moving the kernel-doc comments from highmem.h to highmem-internal.h (which is the file were the kunmap_atomic() macro is actually defined), (2) extending and merging it with the comment which was already in highmem-internal.h, and (3) using correct parameter names (4) correcting a few technical inaccuracies in comments, and (5) adding a deprecation notice in kunmap_atomic() for consistency with kmap_atomic(). Link: https://lkml.kernel.org/r/20220428212455.892-1-fmdefrancesco@gmail.com Link: https://lkml.kernel.org/r/20220428212455.892-2-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Merge tag 'drm-fixes-2022-05-14' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull more drm fixes from Dave Airlie: "Turns out I was right, some fixes hadn't made it to me yet. The vmwgfx ones also popped up later, but all seem like bad enough things to fix. The dma-buf, vc4 and nouveau ones are all pretty small. The fbdev fixes are a bit more complicated: a fix to cleanup fbdev devices properly, uncovered some use-after-free bugs in existing drivers. Then the fix for those bugs wasn't correct. This reverts that fix, and puts the proper fixes in place in the drivers to avoid the use-after-frees. This has had a fair number of eyes on it at this stage, and I'm confident enough that it puts things in the right place, and is less dangerous than reverting our way out of the initial change at this stage. fbdev: - revert NULL deref fix that turned into a use-after-free - prevent use-after-free in fbdev - efifb/simplefb/vesafb: fix cleanup paths to avoid use-after-frees dma-buf: - fix panic in stats setup vc4: - fix hdmi build nouveau: - tegra iommu present fix - fix leak in backlight name vmwgfx: - Black screen due to fences using FIFO checks on SVGA3 - Random black screens on boot due to uninitialized drm_mode_fb_cmd2 - Hangs on SVGA3 due to command buffers being used with gbobjects" * tag 'drm-fixes-2022-05-14' of git://anongit.freedesktop.org/drm/drm: drm/vmwgfx: Disable command buffers on svga3 without gbobjects drm/vmwgfx: Initialize drm_mode_fb_cmd2 drm/vmwgfx: Fix fencing on SVGAv3 drm/vc4: hdmi: Fix build error for implicit function declaration dma-buf: call dma_buf_stats_setup after dmabuf is in valid list fbdev: efifb: Fix a use-after-free due early fb_info cleanup drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name() drm/nouveau/tegra: Stop using iommu_present() fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove fbdev: Prevent possible use-after-free in fb_release() Revert "fbdev: Make fb_release() return -ENODEV if fbdev was unregistered"
2022-05-14pinctrl: stm32: Unshadow np variable in stm32_pctl_probe()Andy Shevchenko
The np variable is used globally for stm32_pctl_probe() and in one of its code branches. cppcheck is not happy with that: pinctrl-stm32.c:1530:23: warning: Local variable 'np' shadows outer variable [shadowVariable] Instead of simply renaming one of the variables convert some code to use a device pointer directly. Fixes: bb949ed9b16b ("pinctrl: stm32: Switch to use for_each_gpiochip_node() helper") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Fabien Dessenne <fabien.dessenne@foss.st.com> Link: https://lore.kernel.org/r/20220507102257.26414-1-andriy.shevchenko@linux.intel.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2022-05-13bpftool: Use sysfs vmlinux when dumping BTF by IDLarysa Zaremba
Currently, dumping almost all BTFs specified by id requires using the -B option to pass the base BTF. For kernel module BTFs the vmlinux BTF sysfs path should work. This patch simplifies dumping by ID usage by loading vmlinux BTF from sysfs as base, if base BTF was not specified and the ID corresponds to a kernel module BTF. Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220513121743.12411-1-larysa.zaremba@intel.com
2022-05-14pinctrl: sunxi: f1c100s: Fix signal name comment for PA2 SPI pinAndre Przywara
The manual describes function 0x6 of pin PA2 as "SPI1_CLK", so change the comment to reflect that. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com> Link: https://lore.kernel.org/r/20220504170736.2669595-1-andre.przywara@arm.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2022-05-14pinctrl: sunxi: fix f1c100s uart2 functionIotaHydrae
Change suniv f1c100s pinctrl,PD14 multiplexing function lvds1 to uart2 When the pin PD13 and PD14 is setting up to uart2 function in dts, there's an error occurred: 1c20800.pinctrl: unsupported function uart2 on pin PD14 Because 'uart2' is not any one multiplexing option of PD14, and pinctrl don't know how to configure it. So change the pin PD14 lvds1 function to uart2. Signed-off-by: IotaHydrae <writeforever@foxmail.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Link: https://lore.kernel.org/r/tencent_70C1308DDA794C81CAEF389049055BACEC09@qq.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2022-05-13block/mq-deadline: Set the fifo_time member also if inserting at headBart Van Assche
Before commit 322cff70d46c the fifo_time member of requests on a dispatch list was not used. Commit 322cff70d46c introduces code that reads the fifo_time member of requests on dispatch lists. Hence this patch that sets the fifo_time member when adding a request to a dispatch list. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Fixes: 322cff70d46c ("block/mq-deadline: Prioritize high-priority requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14Merge tag 'renesas-pinctrl-for-v5.19-tag2' of ↵Linus Walleij
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers into devel pinctrl: renesas: Updates for v5.19 (take two) - Reserved field optimizations, - Miscellaneous fixes and improvements.
2022-05-13bpf: Add MEM_UNINIT as a bpf_type_flagJoanne Koong
Instead of having uninitialized versions of arguments as separate bpf_arg_types (eg ARG_PTR_TO_UNINIT_MEM as the uninitialized version of ARG_PTR_TO_MEM), we can instead use MEM_UNINIT as a bpf_type_flag modifier to denote that the argument is uninitialized. Doing so cleans up some of the logic in the verifier. We no longer need to do two checks against an argument type (eg "if (base_type(arg_type) == ARG_PTR_TO_MEM || base_type(arg_type) == ARG_PTR_TO_UNINIT_MEM)"), since uninitialized and initialized versions of the same argument type will now share the same base type. In the near future, MEM_UNINIT will be used by dynptr helper functions as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20220509224257.3222614-2-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-14Merge tag 'drm-misc-fixes-2022-05-13' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Multiple fixes to fbdev to address a regression at unregistration, an iommu detection improvement for nouveau, a memory leak fix for nouveau, pointer dereference fix for dma_buf_file_release(), and a build breakage fix for vc4 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20220513073044.ymayac7x7bzatrt7@houat
2022-05-14Merge tag 'vmwgfx-drm-fixes-5.18-2022-05-13' of ↵Dave Airlie
https://gitlab.freedesktop.org/zack/vmwgfx into drm-fixes vmwgfx fixes for: - Black screen due to fences using FIFO checks on SVGA3 - Random black screens on boot due to uninitialized drm_mode_fb_cmd2 - Hangs on SVGA3 due to command buffers being used with gbobjects Signed-off-by: Dave Airlie <airlied@redhat.com> From: Zack Rusin <zackr@vmware.com> Link: https://patchwork.freedesktop.org/patch/msgid/a1d32799e4c74b8540216376d7576bb783ca07ba.camel@vmware.com
2022-05-13zsmalloc: fix races between asynchronous zspage free and page migrationSultan Alsawaf
The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races. It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process). Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration. Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Revert "mm/cma.c: remove redundant cma_mutex lock"Dong Aisheng
This reverts commit a4efc174b382fcdb which introduced a regression issue that when there're multiple processes allocating dma memory in parallel by calling dma_alloc_coherent(), it may fail sometimes as follows: Error log: cma: cma_alloc: linux,cma: alloc failed, req-size: 148 pages, ret: -16 cma: number of available pages: 3@125+20@172+12@236+4@380+32@736+17@2287+23@2473+20@36076+99@40477+108@40852+44@41108+20@41196+108@41364+108@41620+ 108@42900+108@43156+483@44061+1763@45341+1440@47712+20@49324+20@49388+5076@49452+2304@55040+35@58141+20@58220+20@58284+ 7188@58348+84@66220+7276@66452+227@74525+6371@75549=> 33161 free of 81920 total pages When issue happened, we saw there were still 33161 pages (129M) free CMA memory and a lot available free slots for 148 pages in CMA bitmap that we want to allocate. When dumping memory info, we found that there was also ~342M normal memory, but only 1352K CMA memory left in buddy system while a lot of pageblocks were isolated. Memory info log: Normal free:351096kB min:30000kB low:37500kB high:45000kB reserved_highatomic:0KB active_anon:98060kB inactive_anon:98948kB active_file:60864kB inactive_file:31776kB unevictable:0kB writepending:0kB present:1048576kB managed:1018328kB mlocked:0kB bounce:0kB free_pcp:220kB local_pcp:192kB free_cma:1352kB lowmem_reserve[]: 0 0 0 Normal: 78*4kB (UECI) 1772*8kB (UMECI) 1335*16kB (UMECI) 360*32kB (UMECI) 65*64kB (UMCI) 36*128kB (UMECI) 16*256kB (UMCI) 6*512kB (EI) 8*1024kB (UEI) 4*2048kB (MI) 8*4096kB (EI) 8*8192kB (UI) 3*16384kB (EI) 8*32768kB (M) = 489288kB The root cause of this issue is that since commit a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock"), CMA supports concurrent memory allocation. It's possible that the memory range process A trying to alloc has already been isolated by the allocation of process B during memory migration. The problem here is that the memory range isolated during one allocation by start_isolate_page_range() could be much bigger than the real size we want to alloc due to the range is aligned to MAX_ORDER_NR_PAGES. Taking an ARMv7 platform with 1G memory as an example, when MAX_ORDER_NR_PAGES is big (e.g. 32M with max_order 14) and CMA memory is relatively small (e.g. 128M), there're only 4 MAX_ORDER slot, then it's very easy that all CMA memory may have already been isolated by other processes when one trying to allocate memory using dma_alloc_coherent(). Since current CMA code will only scan one time of whole available CMA memory, then dma_alloc_coherent() may easy fail due to contention with other processes. This patch simply falls back to the original method that using cma_mutex to make alloc_contig_range() run sequentially to avoid the issue. Link: https://lkml.kernel.org/r/20220509094551.3596244-1-aisheng.dong@nxp.com Link: https://lore.kernel.org/all/20220315144521.3810298-2-aisheng.dong@nxp.com/ Fixes: a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock") Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [5.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13random: use first 128 bits of input as fast initJason A. Donenfeld
Before, the first 64 bytes of input, regardless of how entropic it was, would be used to mutate the crng base key directly, and none of those bytes would be credited as having entropy. Then 256 bits of credited input would be accumulated, and only then would the rng transition from the earlier "fast init" phase into being actually initialized. The thinking was that by mixing and matching fast init and real init, an attacker who compromised the fast init state, considered easy to do given how little entropy might be in those first 64 bytes, would then be able to bruteforce bits from the actual initialization. By keeping these separate, bruteforcing became impossible. However, by not crediting potentially creditable bits from those first 64 bytes of input, we delay initialization, and actually make the problem worse, because it means the user is drawing worse random numbers for a longer period of time. Instead, we can take the first 128 bits as fast init, and allow them to be credited, and then hold off on the next 128 bits until they've accumulated. This is still a wide enough margin to prevent bruteforcing the rng state, while still initializing much faster. Then, rather than trying to piecemeal inject into the base crng key at various points, instead just extract from the pool when we need it, for the crng_init==0 phase. Performance may even be better for the various inputs here, since there are likely more calls to mix_pool_bytes() then there are to get_random_bytes() during this phase of system execution. Since the preinit injection code is gone, bootloader randomness can then do something significantly more straight forward, removing the weird system_wq hack in hwgenerator randomness. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13random: do not use batches when !crng_ready()Jason A. Donenfeld
It's too hard to keep the batches synchronized, and pointless anyway, since in !crng_ready(), we're updating the base_crng key really often, where batching only hurts. So instead, if the crng isn't ready, just call into get_random_bytes(). At this stage nothing is performance critical anyhow. Cc: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13random: mix in timestamps and reseed on system restoreJason A. Donenfeld
Since the RNG loses freshness with system suspend/hibernation, when we resume, immediately reseed using whatever data we can, which for this particular case is the various timestamps regarding system suspend time, in addition to more generally the RDSEED/RDRAND/RDTSC values that happen whenever the crng reseeds. On systems that suspend and resume automatically all the time -- such as Android -- we skip the reseeding on suspend resumption, since that could wind up being far too busy. This is the same trade-off made in WireGuard. In addition to reseeding upon resumption always mix into the pool these various stamps on every power notification event. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13random: vary jitter iterations based on cycle counter speedJason A. Donenfeld
Currently, we do the jitter dance if two consecutive reads to the cycle counter return different values. If they do, then we consider the cycle counter to be fast enough that one trip through the scheduler will yield one "bit" of credited entropy. If those two reads return the same value, then we assume the cycle counter is too slow to show meaningful differences. This methodology is flawed for a variety of reasons, one of which Eric posted a patch to fix in [1]. The issue that patch solves is that on a system with a slow counter, you might be [un]lucky and read the counter _just_ before it changes, so that the second cycle counter you read differs from the first, even though there's usually quite a large period of time in between the two. For example: | real time | cycle counter | | --------- | ------------- | | 3 | 5 | | 4 | 5 | | 5 | 5 | | 6 | 5 | | 7 | 5 | <--- a | 8 | 6 | <--- b | 9 | 6 | <--- c If we read the counter at (a) and compare it to (b), we might be fooled into thinking that it's a fast counter, when in reality it is not. The solution in [1] is to also compare counter (b) to counter (c), on the theory that if the counter is _actually_ slow, and (a)!=(b), then certainly (b)==(c). This helps solve this particular issue, in one sense, but in another sense, it mostly functions to disallow jitter entropy on these systems, rather than simply taking more samples in that case. Instead, this patch takes a different approach. Right now we assume that a difference in one set of consecutive samples means one "bit" of credited entropy per scheduler trip. We can extend this so that a difference in two sets of consecutive samples means one "bit" of credited entropy per /two/ scheduler trips, and three for three, and four for four. In other words, we can increase the amount of jitter "work" we require for each "bit", depending on how slow the cycle counter is. So this patch takes whole bunch of samples, sees how many of them are different, and divides to find the amount of work required per "bit", and also requires that at least some minimum of them are different in order to attempt any jitter entropy. Note that this approach is still far from perfect. It's not a real statistical estimate on how much these samples vary; it's not a real-time analysis of the relevant input data. That remains a project for another time. However, it makes the same (partly flawed) assumptions as the code that's there now, so it's probably not worse than the status quo, and it handles the issue Eric mentioned in [1]. But, again, it's probably a far cry from whatever a really robust version of this would be. [1] https://lore.kernel.org/lkml/20220421233152.58522-1-ebiggers@kernel.org/ https://lore.kernel.org/lkml/20220421192939.250680-1-ebiggers@kernel.org/ Cc: Eric Biggers <ebiggers@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13random: insist on random_get_entropy() existing in order to simplifyJason A. Donenfeld
All platforms are now guaranteed to provide some value for random_get_entropy(). In case some bug leads to this not being so, we print a warning, because that indicates that something is really very wrong (and likely other things are impacted too). This should never be hit, but it's a good and cheap way of finding out if something ever is problematic. Since we now have viable fallback code for random_get_entropy() on all platforms, which is, in the worst case, not worse than jiffies, we can count on getting the best possible value out of it. That means there's no longer a use for using jiffies as entropy input. It also means we no longer have a reason for doing the round-robin register flow in the IRQ handler, which was always of fairly dubious value. Instead we can greatly simplify the IRQ handler inputs and also unify the construction between 64-bits and 32-bits. We now collect the cycle counter and the return address, since those are the two things that matter. Because the return address and the irq number are likely related, to the extent we mix in the irq number, we can just xor it into the top unchanging bytes of the return address, rather than the bottom changing bytes of the cycle counter as before. Then, we can do a fixed 2 rounds of SipHash/HSipHash. Finally, we use the same construction of hashing only half of the [H]SipHash state on 32-bit and 64-bit. We're not actually discarding any entropy, since that entropy is carried through until the next time. And more importantly, it lets us do the same sponge-like construction everywhere. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13xtensa: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. This is accomplished by just including the asm-generic code like on other architectures, which means we can get rid of the empty stub function here. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13sparc: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. This is accomplished by just including the asm-generic code like on other architectures, which means we can get rid of the empty stub function here. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13um: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. This is accomplished by just including the asm-generic code like on other architectures, which means we can get rid of the empty stub function here. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13x86/tsc: Use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is suboptimal. Instead, fallback to calling random_get_entropy_fallback(), which isn't extremely high precision or guaranteed to be entropic, but is certainly better than returning zero all the time. If CONFIG_X86_TSC=n, then it's possible for the kernel to run on systems without RDTSC, such as 486 and certain 586, so the fallback code is only required for that case. As well, fix up both the new function and the get_cycles() function from which it was derived to use cpu_feature_enabled() rather than boot_cpu_has(), and use !IS_ENABLED() instead of #ifndef. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org
2022-05-13nios2: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13arm: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13mips: use fallback for random_get_entropy() instead of just c0 randomJason A. Donenfeld
For situations in which we don't have a c0 counter register available, we've been falling back to reading the c0 "random" register, which is usually bounded by the amount of TLB entries and changes every other cycle or so. This means it wraps extremely often. We can do better by combining this fast-changing counter with a potentially slower-changing counter from random_get_entropy_fallback() in the more significant bits. This commit combines the two, taking into account that the changing bits are in a different bit position depending on the CPU model. In addition, we previously were falling back to 0 for ancient CPUs that Linux does not support anyway; remove that dead path entirely. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Tested-by: Maciej W. Rozycki <macro@orcam.me.uk> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13riscv: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Walmsley <paul.walmsley@sifive.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13m68k: use fallback for random_get_entropy() instead of zeroJason A. Donenfeld
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13timekeeping: Add raw clock fallback for random_get_entropy()Jason A. Donenfeld
The addition of random_get_entropy_fallback() provides access to whichever time source has the highest frequency, which is useful for gathering entropy on platforms without available cycle counters. It's not necessarily as good as being able to quickly access a cycle counter that the CPU has, but it's still something, even when it falls back to being jiffies-based. In the event that a given arch does not define get_cycles(), falling back to the get_cycles() default implementation that returns 0 is really not the best we can do. Instead, at least calling random_get_entropy_fallback() would be preferable, because that always needs to return _something_, even falling back to jiffies eventually. It's not as though random_get_entropy_fallback() is super high precision or guaranteed to be entropic, but basically anything that's not zero all the time is better than returning zero all the time. Finally, since random_get_entropy_fallback() is used during extremely early boot when randomizing freelists in mm_init(), it can be called before timekeeping has been initialized. In that case there really is nothing we can do; jiffies hasn't even started ticking yet. So just give up and return 0. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Theodore Ts'o <tytso@mit.edu>
2022-05-13openrisc: start CPU timer early in bootJason A. Donenfeld
In order to measure the boot process, the timer should be switched on as early in boot as possible. As well, the commit defines the get_cycles macro, like the previous patches in this series, so that generic code is aware that it's implemented by the platform, as is done on other archs. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Acked-by: Stafford Horne <shorne@gmail.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13powerpc: define get_cycles macro for arch-overrideJason A. Donenfeld
PowerPC defines a get_cycles() function, but it does not do the usual `#define get_cycles get_cycles` dance, making it impossible for generic code to see if an arch-specific function was defined. While the get_cycles() ifdef is not currently used, the following timekeeping patch in this series will depend on the macro existing (or not existing) when defining random_get_entropy(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@ozlabs.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13alpha: define get_cycles macro for arch-overrideJason A. Donenfeld
Alpha defines a get_cycles() function, but it does not do the usual `#define get_cycles get_cycles` dance, making it impossible for generic code to see if an arch-specific function was defined. While the get_cycles() ifdef is not currently used, the following timekeeping patch in this series will depend on the macro existing (or not existing) when defining random_get_entropy(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Acked-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13parisc: define get_cycles macro for arch-overrideJason A. Donenfeld
PA-RISC defines a get_cycles() function, but it does not do the usual `#define get_cycles get_cycles` dance, making it impossible for generic code to see if an arch-specific function was defined. While the get_cycles() ifdef is not currently used, the following timekeeping patch in this series will depend on the macro existing (or not existing) when defining random_get_entropy(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13s390: define get_cycles macro for arch-overrideJason A. Donenfeld
S390x defines a get_cycles() function, but it does not do the usual `#define get_cycles get_cycles` dance, making it impossible for generic code to see if an arch-specific function was defined. While the get_cycles() ifdef is not currently used, the following timekeeping patch in this series will depend on the macro existing (or not existing) when defining random_get_entropy(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13ia64: define get_cycles macro for arch-overrideJason A. Donenfeld
Itanium defines a get_cycles() function, but it does not do the usual `#define get_cycles get_cycles` dance, making it impossible for generic code to see if an arch-specific function was defined. While the get_cycles() ifdef is not currently used, the following timekeeping patch in this series will depend on the macro existing (or not existing) when defining random_get_entropy(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>