summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-19mm: don't be stuck to rmap lock on reclaim pathMinchan Kim
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [minchan@kernel.org: add comment, per Matthew] Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: John Dias <joaodias@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19zswap: memcg accountingJohannes Weiner
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this. The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set. The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice. [hannes@cmpxchg.org: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org [hannes@cmpxchg.org: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: zswap: add basic meminfo and vmstat coverageJohannes Weiner
Currently it requires poking at debugfs to figure out the size and population of the zswap cache on a host. There are no counters for reads and writes against the cache. As a result, it's difficult to understand zswap behavior on production systems. Print zswap memory consumption and how many pages are zswapped out in /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat. Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: Kconfig: simplify zswap configurationJohannes Weiner
- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is not. Don't make the user chase down a technical dependency like that, just select it in automatically when zram is requested. The CONFIG_CRYPTO dependency is redundant due to more specific deps. - CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and have it selected in as needed. - CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern. - Make the ZSWAP suboptions and their descriptions (compression, allocation backend) a bit more straight-forward for the user. Link: https://lkml.kernel.org/r/20220510152847.230957-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: Kconfig: group swap, slab, hotplug and thp options into submenusJohannes Weiner
There are several clusters of related config options spread throughout the mostly flat MM submenu. Group them together and put specialization options into further subdirectories to make the MM submenu a bit more organized and easier to navigate. [hannes@cmpxchg.org: fix kbuild warnings] Link: https://lkml.kernel.org/r/YnvkSVivfnT57Vwh@cmpxchg.org [hannes@cmpxchg.org: fix more kbuild warnings] Link: https://lkml.kernel.org/r/Ynz8NusTdEGcCnJN@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: Kconfig: move swap and slab config options to the MM sectionJohannes Weiner
These are currently under General Setup. MM seems like a better fit. Link: https://lkml.kernel.org/r/20220510152847.230957-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19Documentation: filesystems: proc: update meminfo sectionJohannes Weiner
Patch series "zswap: accounting & cgroup control", v2. Zswap can consume nearly a quarter of RAM in the default configuration, yet it's neither listed in /proc/meminfo, nor is it accounted and manageable on a per-cgroup basis. This makes reasoning about the memory situation on a host in general rather difficult. On shared/cgrouped hosts, the consequences are worse. First, workloads can escape memory containment and cause resource priority inversions: a lo-pri group can fill the global zswap pool and force a hi-pri group out to disk. Second, not all workloads benefit from zswap equally. Some even suffer when memory contents compress poorly, and are better off going to disk swap directly. On a host with mixed workloads, it's currently not possible to enable zswap for one workload but not for the other. This series implements the missing global accounting as well as cgroup tracking & control for zswap backing memory: - Patch 1 refreshes the very out-of-date meminfo documentation in Documentation/filesystems/proc.rst. - Patches 2-4 clean up related and adjacent options in Kconfig. Not actual dependencies, just things I noticed during development. - Patch 5 adds meminfo and vmstat coverage for zswap consumption and activity. - Patch 6 implements per-cgroup tracking & control of zswap memory. This patch (of 6): Add new entries. Minor corrections and cleanups. [hannes@cmpxchg.org: fix htmldocs warnings] Link: https://lkml.kernel.org/r/Ynve8dg4zJyhH2gW@cmpxchg.org [hannes@cmpxchg.org: change `Unevictable' wording, per David] Link: https://lkml.kernel.org/r/YnwFraZlVWQoCjz3@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: fix comment about swap extentMiaohe Lin
Since commit 4efaceb1c5f8 ("mm, swap: use rbtree for swap_extent"), rbtree is used for swap extent. Also curr_swap_extent is removed at that time. Update the corresponding comment. Link: https://lkml.kernel.org/r/20220509131416.17553-16-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: fix the comment of get_kernel_pagesMiaohe Lin
If no pages were pinned, 0 is returned in fact. Fix the corresponding comment. [akpm@linux-foundation.org: s/nr_pages/nr_segs/ also, per David, reflow comment] Link: https://lkml.kernel.org/r/20220509131416.17553-15-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: clean up the comment of find_next_to_unuseMiaohe Lin
Since commit 10a9c496789f ("mm: simplify try_to_unuse"), frontswap parameter is removed. Update the corresponding comment. Link: https://lkml.kernel.org/r/20220509131416.17553-14-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: fix the obsolete comment for SWP_TYPE_SHIFTMiaohe Lin
Since commit 3159f943aafd ("xarray: Replace exceptional entries"), there is only one bit of 'type' can be shifted up. Update the corresponding comment. Link: https://lkml.kernel.org/r/20220509131416.17553-13-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: add helper swap_offset_available()Miaohe Lin
Add helper swap_offset_available() to remove some duplicated codes. Minor readability improvement. [akpm@linux-foundation.org: s/swap_offset_available/swap_offset_available_and_locked/, per Neil] Link: https://lkml.kernel.org/r/20220509131416.17553-12-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITESMiaohe Lin
Use flags of si directly to check SWP_STABLE_WRITES to avoid possible READ_ONCE and thus save some cpu cycles. [akpm@linux-foundation.org: use data_race() on si->flags, per Neil] Link: https://lkml.kernel.org/r/20220509131416.17553-10-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: make page_swapcount and __lru_add_drain_all staticMiaohe Lin
Make page_swapcount and __lru_add_drain_all static. They are only used within the file now. Link: https://lkml.kernel.org/r/20220509131416.17553-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: remove unneeded p != NULL check in __swap_duplicateMiaohe Lin
If p is NULL, __swap_duplicate will already return -EINVAL. So if we reach here, p must be non-NULL. Link: https://lkml.kernel.org/r/20220509131416.17553-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: remove buggy cache->nr check in refill_swap_slots_cacheMiaohe Lin
refill_swap_slots_cache is always called when cache->nr is 0. So remove such buggy and confusing check. Link: https://lkml.kernel.org/r/20220509131416.17553-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: print bad swap offset entry in get_swap_deviceMiaohe Lin
If offset exceeds the si->max, print bad swap offset entry to help debug the unexpected case. Link: https://lkml.kernel.org/r/20220509131416.17553-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: remove unneeded return value of free_swap_slotMiaohe Lin
The return value of free_swap_slot is always 0 and also ignored now. Remove it to clean up the code. Link: https://lkml.kernel.org/r/20220509131416.17553-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: fold __swap_info_get() into its sole callerMiaohe Lin
Fold __swap_info_get() into its sole caller to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: use helper macro __ATTR_RWMiaohe Lin
Use helper macro __ATTR_RW to define vma_ra_enabled_attr to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm/swap: use helper is_swap_pte() in swap_vma_readaheadMiaohe Lin
Patch series "A few cleanup patches for swap". This series contains a few patches to fix the comment, remove unneeded return value, use some helpers and so on. More details can be found in the respective changelogs. This patch (of 14): Use helper is_swap_pte() to check whether pte is swap entry to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220509131416.17553-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: mmap: register suitable readonly file vmas for khugepagedYang Shi
The readonly FS THP relies on khugepaged to collapse THP for suitable vmas. But the behavior is inconsistent for "always" mode (https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/). The "always" mode means THP allocation should be tried all the time and khugepaged should try to collapse THP all the time. Of course the allocation and collapse may fail due to other factors and conditions. Currently file THP may not be collapsed by khugepaged even though all the conditions are met. That does break the semantics of "always" mode. So make sure readonly FS vmas are registered to khugepaged to fix the break. Register suitable vmas in common mmap path, that could cover both readonly FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c. Still need to keep the khugepaged call in vma_merge() since vma_merge() is called in a lot of places, for example, madvise, mprotect, etc. Link: https://lkml.kernel.org/r/20220510203222.24246-9-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Song Liu <song@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: khugepaged: introduce khugepaged_enter_vma() helperYang Shi
The khugepaged_enter_vma_merge() actually does as the same thing as the khugepaged_enter() section called by shmem_mmap(), so consolidate them into one helper and rename it to khugepaged_enter_vma(). Link: https://lkml.kernel.org/r/20220510203222.24246-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: khugepaged: make hugepage_vma_check() non-staticYang Shi
The hugepage_vma_check() could be reused by khugepaged_enter() and khugepaged_enter_vma_merge(), but it is static in khugepaged.c. Make it non-static and declare it in khugepaged.h. Link: https://lkml.kernel.org/r/20220510203222.24246-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: khugepaged: make khugepaged_enter() void functionYang Shi
The most callers of khugepaged_enter() don't care about the return value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the error by returning -ENOMEM. Actually it is not harmful for them to ignore the error case either. It also sounds overkilling to fail fork() and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set VM_HUGEPAGE flag regardless of the error. Link: https://lkml.kernel.org/r/20220510203222.24246-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: thp: only regular file could be THP eligibleYang Shi
Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for special files"), khugepaged just collapses THP for regular file which is the intended usecase for readonly fs THP. Only show regular file as THP eligible accordingly. And make file_thp_enabled() available for khugepaged too in order to remove duplicate code. Link: https://lkml.kernel.org/r/20220510203222.24246-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: khugepaged: skip DAX vmaYang Shi
The DAX vma may be seen by khugepaged when the mm has other khugepaged suitable vmas. So khugepaged may try to collapse THP for DAX vma, but it will fail due to page sanity check, for example, page is not on LRU. So it is not harmful, but it is definitely pointless to run khugepaged against DAX vma, so skip it in early check. Link: https://lkml.kernel.org/r/20220510203222.24246-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGEDYang Shi
The hugepage_vma_check() called by khugepaged_enter_vma_merge() does check VM_NO_KHUGEPAGED. Remove the check from caller and move the check in hugepage_vma_check() up. More checks may be run for VM_NO_KHUGEPAGED vmas, but MADV_HUGEPAGE is definitely not a hot path, so cleaner code does outweigh. Link: https://lkml.kernel.org/r/20220510203222.24246-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19sched: coredump.h: clarify the use of MMF_VM_HUGEPAGEYang Shi
Patch series "Make khugepaged collapse readonly FS THP more consistent", v4. The readonly FS THP relies on khugepaged to collapse THP for suitable vmas. But the behavior is inconsistent for "always" mode (https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/). The "always" mode means THP allocation should be tried all the time and khugepaged should try to collapse THP all the time. Of course the allocation and collapse may fail due to other factors and conditions. Currently file THP may not be collapsed by khugepaged even though all the conditions are met. That does break the semantics of "always" mode. So make sure readonly FS vmas are registered to khugepaged to fix the break. Register suitable vmas in common mmap path, that could cover both readonly FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c. The patch 1-7 are minor bug fixes, clean up and preparation patches. Patch 8 is the real meat. Tested with khugepaged test in selftests and the testcase provided by Vlastimil Babka in https://lore.kernel.org/lkml/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/ by commenting out MADV_HUGEPAGE call. This patch (of 8): MMF_VM_HUGEPAGE is set as long as the mm is available for khugepaged by khugepaged_enter(), not only when VM_HUGEPAGE is set on vma. Correct the comment to avoid confusion. Link: https://lkml.kernel.org/r/20220510203222.24246-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20220510203222.24246-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19arm64/mm: fix page table check compile error for CONFIG_PGTABLE_LEVELS=2Tong Tiangen
If CONFIG_PGTABLE_LEVELS=2 and CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y, then we trigger a compile error: error: implicit declaration of function 'pte_user_accessible_page' Move the definition of page table check helper out of branch CONFIG_PGTABLE_LEVELS > 2 Link: https://lkml.kernel.org/r/20220517074548.2227779-3-tongtiangen@huawei.com Fixes: daf214c14dbe ("arm64/mm: enable ARCH_SUPPORTS_PAGE_TABLE_CHECK") Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Guohanjun <guohanjun@huawei.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19riscv/mm: fix two page table check related issuesTong Tiangen
Two page table check related issues have been fixed here. 1. Open CONFIG_PAGE_TABLE_CHECK in riscv32, we got a compile error[1]: error: implicit declaration of function 'pud_leaf' Add pud_leaf() definition to incluce/asm-generic/pgtable-nopmd.h to fix this issue. 2. Keep consistent with other pud_xxx() helpers, move pud_user() to pgtable-64.h and add pud_user() to pgtable-nopmd.h. [1]https://lore.kernel.org/linux-mm/202205161811.2nLxmN2O-lkp@intel.com/T/ Link: https://lkml.kernel.org/r/20220517074548.2227779-2-tongtiangen@huawei.com Fixes: 856eed79f8d3 ("riscv/mm: enable ARCH_SUPPORTS_PAGE_TABLE_CHECK") Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Guohanjun <guohanjun@huawei.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Will Deacon <will@kernel.org> Cc: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19dt-bindings: input: touchscreen: ilitek_ts_i2c: Absorb ili2xxx bindingsGeert Uytterhoeven
While Linux uses a different driver, the Ilitek ILI210x/ILI2117/ILI2120/ILI251x touchscreen controller Device Tree binding documentation is very similar. - Drop the fixed reg value, as some controllers use a different address, - Make reset-gpios optional, as it is not always wired. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/0c5f06c9d262c1720b40d068b6eefe58ca406601.1638539806.git.geert+renesas@glider.be
2022-05-19dt-bindings: timer: samsung,exynos4210-mct: define strict clock orderKrzysztof Kozlowski
The DTS should always have fixed clock order, even if it comes with clock-names property. Drop the pattern to make the order strict. Existing DTS already match this. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220424150333.75172-3-krzysztof.kozlowski@linaro.org
2022-05-19dt-bindings: timer: samsung,exynos4210-mct: drop unneeded minItemsKrzysztof Kozlowski
There is no need to add minItems when it is equal to maxItems. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220424150333.75172-2-krzysztof.kozlowski@linaro.org
2022-05-19dt-bindings: timer: cdns,ttc: drop unneeded minItemsKrzysztof Kozlowski
There is no need to add minItems when it is equal to maxItems. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220424150333.75172-1-krzysztof.kozlowski@linaro.org
2022-05-19can: mcp251xfd: silence clang's -Wunaligned-access warningVincent Mailhol
clang emits a -Wunaligned-access warning on union mcp251xfd_tx_ojb_load_buf. The reason is that field hw_tx_obj (not declared as packed) is being packed right after a 16 bits field inside a packed struct: | union mcp251xfd_tx_obj_load_buf { | struct __packed { | struct mcp251xfd_buf_cmd cmd; | /* ^ 16 bits fields */ | struct mcp251xfd_hw_tx_obj_raw hw_tx_obj; | /* ^ not declared as packed */ | } nocrc; | struct __packed { | struct mcp251xfd_buf_cmd_crc cmd; | struct mcp251xfd_hw_tx_obj_raw hw_tx_obj; | __be16 crc; | } crc; | } ____cacheline_aligned; Starting from LLVM 14, having an unpacked struct nested in a packed struct triggers a warning. c.f. [1]. This is a false positive because the field is always being accessed with the relevant put_unaligned_*() function. Adding __packed to the structure declaration silences the warning. [1] https://github.com/llvm/llvm-project/issues/55520 Link: https://lore.kernel.org/all/20220518114357.55452-1-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Reported-by: kernel test robot <lkp@intel.com> Tested-by: Nathan Chancellor <nathan@kernel.org> # build Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19can: can-dev: remove obsolete CAN LED supportOliver Hartkopp
Since commit 30f3b42147ba6f ("can: mark led trigger as broken") the CAN specific LED support was disabled and marked as BROKEN. As the common LED support with CONFIG_LEDS_TRIGGER_NETDEV should do this work now the code can be removed as preparation for a CAN netdevice Kconfig rework. Link: https://lore.kernel.org/all/20220518154527.29046-1-socketcan@hartkopp.net Suggested-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> [mkl: remove led.h from MAINTAINERS] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19can: can-dev: move to netif_napi_add_weight()Jakub Kicinski
We want to remove the weight argument from the basic version of the netif_napi_add() call. Move all the callers in drivers/net/can that pass a custom weight (i.e. not NAPI_POLL_WEIGHT or 64) to the netif_napi_add_weight() API. Link: https://lore.kernel.org/all/20220517002345.1812104-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19can: isotp: isotp_bind(): do not validate unused address informationOliver Hartkopp
With commit 2aa39889c463 ("can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting") the bind() syscall returns -EINVAL when the given CAN ID needed to be sanitized. But in the case of an unconfirmed broadcast mode the rx CAN ID is not needed and may be uninitialized from the caller - which is ok. This patch makes sure the result of an inproper CAN ID format is only provided when the address information is needed. Link: https://lore.kernel.org/all/20220517145653.2556-1-socketcan@hartkopp.net Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19Merge tag 'wireless-next-2022-05-19' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v5.19 Second set of patches for v5.19 and most likely the last one. rtw89 got support for 8852ce devices and mt76 now supports Wireless Ethernet Dispatch. Major changes: cfg80211/mac80211 - support disabling EHT mode rtw89 - add support for Realtek 8852ce devices mt76 - Wireless Ethernet Dispatch support for flow offload - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 ipv6 NS offload support ath11k - enable keepalive during WoWLAN suspend - implement remain-on-channel support * tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (135 commits) iwlwifi: mei: fix potential NULL-ptr deref iwlwifi: mei: clear the sap data header before sending iwlwifi: mvm: remove vif_count iwlwifi: mvm: always tell the firmware to accept MCAST frames in BSS iwlwifi: mvm: add OTP info in case of init failure iwlwifi: mvm: fix assert 1F04 upon reconfig iwlwifi: fw: init SAR GEO table only if data is present iwlwifi: mvm: clean up authorized condition iwlwifi: mvm: use NULL instead of ERR_PTR when parsing wowlan status iwlwifi: pcie: simplify MSI-X cause mapping rtw89: pci: only mask out INT indicator register for disable interrupt v1 rtw89: convert rtw89_band to nl80211_band precisely rtw89: 8852c: update txpwr tables to HALRF_027_00_052 rtw89: cfo: check mac_id to avoid out-of-bounds rtw89: 8852c: set TX antenna path rtw89: add ieee80211::sta_rc_update ops wireless: Fix Makefile to be in alphabetical order mac80211: refactor freeing the next_beacon cfg80211: fix kernel-doc for cfg80211_beacon_data mac80211: minstrel_ht: support ieee80211_rate_status ... ==================== Link: https://lore.kernel.org/r/20220519153334.8D051C385AA@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19nvme: set non-mdts limits in nvme_scan_workChaitanya Kulkarni
In current implementation we set the non-mdts limits by calling nvme_init_non_mdts_limits() from nvme_init_ctrl_finish(). This also tries to set the limits for the discovery controller which has no I/O queues resulting in the warning message reported by the nvme_log_error() when running blktest nvme/002: - [ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47 [ 2005.192223] loop: module loaded [ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 [ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 <------------------------------SNIP----------------------------------> [ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997 [ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998 [ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999 [ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn. *[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR* [ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery" [ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after we verify that I/O queues are created since that is a converging point for each transport where these limits are actually used. 1. FC : nvme_fc_create_association() ... nvme_fc_create_io_queues(ctrl); ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 2. PCIe:- nvme_reset_work() ... nvme_setup_io_queues() nvme_create_io_queues() nvme_alloc_queue() ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 3. RDMA :- nvme_rdma_setup_ctrl ... nvme_rdma_configure_io_queues ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 4. TCP :- nvme_tcp_setup_ctrl ... nvme_tcp_configure_io_queues ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() * nvme_scan_work() ... nvme_validate_or_alloc_ns() nvme_alloc_ns() nvme_update_ns_info() nvme_update_disk_info() nvme_config_discard() <--- blk_queue_max_write_zeroes_sectors() <--- Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-05-19x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regionsHans de Goede
Some firmware supplies PCI host bridge _CRS that includes address space unusable by PCI devices, e.g., space occupied by host bridge registers or used by hidden PCI devices. To avoid this unusable space, Linux currently excludes E820 reserved regions from _CRS windows; see 4dc2287c1805 ("x86: avoid E820 regions when allocating address space"). However, this use of E820 reserved regions to clip things out of _CRS is not supported by ACPI, UEFI, or PCI Firmware specs, and some systems have E820 reserved regions that cover the entire memory window from _CRS. 4dc2287c1805 clips the entire window, leaving no space for hot-added or uninitialized PCI devices. For example, from a Lenovo IdeaPad 3 15IIL 81WE: BIOS-e820: [mem 0x4bc50000-0xcfffffff] reserved pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window] pci 0000:00:15.0: BAR 0: [mem 0x00000000-0x00000fff 64bit] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit] Future patches will add quirks to enable/disable E820 clipping automatically. Add a "pci=no_e820" kernel command line option to disable clipping with E820 reserved regions. Also add a matching "pci=use_e820" option to enable clipping with E820 reserved regions if that has been disabled by default by further patches in this patch-set. Both options taint the kernel because they are intended for debugging and workaround purposes until a quirk can set them automatically. [bhelgaas: commit log, add printk] Link: https://bugzilla.redhat.com/show_bug.cgi?id=1868899 Lenovo IdeaPad 3 Link: https://lore.kernel.org/r/20220519152150.6135-2-hdegoede@redhat.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Benoit Grégoire <benoitg@coeus.ca> Cc: Hui Wang <hui.wang@canonical.com>
2022-05-19RISC-V: Load purgatory in kexec_fileLi Zhengyu
This patch supports kexec_file to load and relocate purgatory. It works well on riscv64 QEMU, being tested with devmem. Signed-off-by: Li Zhengyu <lizhengyu3@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-7-lizhengyu3@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19RISC-V: Add purgatoryLi Zhengyu
This patch adds purgatory, the name and concept have been taken from kexec-tools. Purgatory runs between two kernels, and do verify sha256 hash to ensure the kernel to jump to is fine and has not been corrupted after loading. Makefile is modified based on x86 platform. Signed-off-by: Li Zhengyu <lizhengyu3@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-6-lizhengyu3@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19RISC-V: Support for kexec_file on panicLi Zhengyu
This patch adds support for loading a kexec on panic (kdump) kernel. It has been tested with vmcore-dmesg on riscv64 QEMU on both an smp and a non-smp system. Signed-off-by: Li Zhengyu <lizhengyu3@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-5-lizhengyu3@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19RISC-V: Add kexec_file supportLiao Chang
This patch adds support for kexec_file on RISC-V. I tested it on riscv64 QEMU with busybear-linux and single core along with the OpenSBI firmware fw_jump.bin for generic platform. On SMP system, it depends on CONFIG_{HOTPLUG_CPU, RISCV_SBI} to resume/stop hart through OpenSBI firmware, it also needs a OpenSBI that support the HSM extension. Signed-off-by: Liao Chang <liaochang1@huawei.com> Signed-off-by: Li Zhengyu <lizhengyu3@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-4-lizhengyu3@huawei.com [Palmer: Make 64-bit only] Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr()Daisuke Matsuda
The pointer imr->umem is assigned twice. Fix this by removing the redundant one. Link: https://lore.kernel.org/r/20220518044914.1903125-1-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-19RISC-V: use memcpy for kexec_file modeLiao Chang
The pointer to buffer loading kernel binaries is in kernel space for kexec_fil mode, When copy_from_user copies data from pointer to a block of memory, it checkes that the pointer is in the user space range, on RISCV-V that is: static inline bool __access_ok(unsigned long addr, unsigned long size) { return size <= TASK_SIZE && addr <= TASK_SIZE - size; } and TASK_SIZE is 0x4000000000 for 64-bits, which now causes copy_from_user to reject the access of the field 'buf' of struct kexec_segment that is in range [CONFIG_PAGE_OFFSET - VMALLOC_SIZE, CONFIG_PAGE_OFFSET), is invalid user space pointer. This patch fixes this issue by skipping access_ok(), use mempcy() instead. Signed-off-by: Liao Chang <liaochang1@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-3-lizhengyu3@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19kexec_file: Fix kexec_file.c build error for riscv platformLiao Chang
When CONFIG_KEXEC_FILE is set for riscv platform, the compilation of kernel/kexec_file.c generate build error: kernel/kexec_file.c: In function 'crash_prepare_elf64_headers': ./arch/riscv/include/asm/page.h:110:71: error: request for member 'virt_addr' in something not a structure or union 110 | ((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < kernel_map.virt_addr)) | ^ ./arch/riscv/include/asm/page.h:131:2: note: in expansion of macro 'is_linear_mapping' 131 | is_linear_mapping(_x) ? \ | ^~~~~~~~~~~~~~~~~ ./arch/riscv/include/asm/page.h:140:31: note: in expansion of macro '__va_to_pa_nodebug' 140 | #define __phys_addr_symbol(x) __va_to_pa_nodebug(x) | ^~~~~~~~~~~~~~~~~~ ./arch/riscv/include/asm/page.h:143:24: note: in expansion of macro '__phys_addr_symbol' 143 | #define __pa_symbol(x) __phys_addr_symbol(RELOC_HIDE((unsigned long)(x), 0)) | ^~~~~~~~~~~~~~~~~~ kernel/kexec_file.c:1327:36: note: in expansion of macro '__pa_symbol' 1327 | phdr->p_offset = phdr->p_paddr = __pa_symbol(_text); This occurs is because the "kernel_map" referenced in macro is_linear_mapping() is suppose to be the one of struct kernel_mapping defined in arch/riscv/mm/init.c, but the 2nd argument of crash_prepare_elf64_header() has same symbol name, in expansion of macro is_linear_mapping in function crash_prepare_elf64_header(), "kernel_map" actually is the local variable. Signed-off-by: Liao Chang <liaochang1@huawei.com> Link: https://lore.kernel.org/r/20220408100914.150110-2-lizhengyu3@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-05-19riscv: dts: sifive: fu540-c000: align dma node name with dtschemaKrzysztof Kozlowski
Fixes dtbs_check warnings like: dma@3000000: $nodename:0: 'dma@3000000' does not match '^dma-controller(@.*)?$' Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20220407193856.18223-1-krzysztof.kozlowski@linaro.org Fixes: c5ab54e9945b ("riscv: dts: add support for PDMA device of HiFive Unleashed Rev A00") Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>