summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2021-06-29mm: define default MAX_PTRS_PER_* in include/pgtable.hDaniel Axtens
Commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable") made PTRS_PER_P4D variable on x86 and introduced MAX_PTRS_PER_P4D as a constant for cases which need a compile-time constant (e.g. fixed-size arrays). powerpc likewise has boot-time selectable MMU features which can cause other mm "constants" to vary. For KASAN, we have some static PTE/PMD/PUD/P4D arrays so we need compile-time maximums for all these constants. Extend the MAX_PTRS_PER_ idiom, and place default definitions in include/pgtable.h. These define MAX_PTRS_PER_x to be PTRS_PER_x unless an architecture has defined MAX_PTRS_PER_x in its arch headers. Clean up pgtable-nop4d.h and s390's MAX_PTRS_PER_P4D definitions while we're at it: both can just pick up the default now. Link: https://lkml.kernel.org/r/20210624034050.511391-4-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Marco Elver <elver@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29kasan: test: improve failure message in KUNIT_EXPECT_KASAN_FAIL()David Gow
The KUNIT_EXPECT_KASAN_FAIL() macro currently uses KUNIT_EXPECT_EQ() to compare fail_data.report_expected and fail_data.report_found. This always gave a somewhat useless error message on failure, but the addition of extra compile-time checking with READ_ONCE() has caused it to get much longer, and be truncated before anything useful is displayed. Instead, just check fail_data.report_found by hand (we've just set report_expected to 'true'), and print a better failure message with KUNIT_FAIL(). Because of this, report_expected is no longer used anywhere, and can be removed. Beforehand, a failure in: KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)area)[3100]); would have looked like: [22:00:34] [FAILED] vmalloc_oob [22:00:34] # vmalloc_oob: EXPECTATION FAILED at lib/test_kasan.c:991 [22:00:34] Expected ({ do { extern void __compiletime_assert_705(void) __attribute__((__error__("Unsupported access size for {READ,WRITE}_ONCE()."))); if (!((sizeof(fail_data.report_expected) == sizeof(char) || sizeof(fail_data.repp [22:00:34] not ok 45 - vmalloc_oob With this change, it instead looks like: [22:04:04] [FAILED] vmalloc_oob [22:04:04] # vmalloc_oob: EXPECTATION FAILED at lib/test_kasan.c:993 [22:04:04] KASAN failure expected in "((volatile char *)area)[3100]", but none occurred [22:04:04] not ok 45 - vmalloc_oob Also update the example failure in the documentation to reflect this. Link: https://lkml.kernel.org/r/20210606005531.165954-1-davidgow@google.com Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Brendan Higgins <brendanhiggins@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29printk: introduce dump_stack_lvl()Alexander Potapenko
dump_stack() is used for many different cases, which may require a log level consistent with other kernel messages surrounding the dump_stack() call. Without that, certain systems that are configured to ignore the default level messages will miss stack traces in critical error reports. This patch introduces dump_stack_lvl() that behaves similarly to dump_stack(), but accepts a custom log level. The old dump_stack() becomes equal to dump_stack_lvl(KERN_DEFAULT). A somewhat similar patch has been proposed in 2012: https://lore.kernel.org/lkml/1332493269.2359.9.camel@hebo/ , but wasn't merged. [elver@google.com: add missing dump_stack_lvl() stub if CONFIG_PRINTK=n] Link: https://lkml.kernel.org/r/YJ0KAM0hQev1AmWe@elver.google.com Link: https://lkml.kernel.org/r/20210506105405.3535023-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: he, bo <bo.he@intel.com> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Prasad Sodagudi <psodagud@quicinc.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_alloc: add an alloc_pages_bulk_array_node() helperUladzislau Rezki (Sony)
Patch series "vmalloc() vs bulk allocator", v2. This patch (of 3): Add a "node" variant of the alloc_pages_bulk_array() function. The helper guarantees that a __alloc_pages_bulk() is invoked with a valid NUMA node ID. Link: https://lkml.kernel.org/r/20210516202056.2120-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20210516202056.2120-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: add vma_lookup(), update find_vma_intersection() commentsLiam Howlett
Patch series "mm: Add vma_lookup()", v2. Many places in the kernel use find_vma() to get a vma and then check the start address of the vma to ensure the next vma was not returned. Other places use the find_vma_intersection() call with add, addr + 1 as the range; looking for just the vma at a specific address. The third use of find_vma() is by developers who do not know that the function starts searching at the provided address upwards for the next vma. This results in a bug that is often overlooked for a long time. Adding the new vma_lookup() function will allow for cleaner code by removing the find_vma() calls which check limits, making find_vma_intersection() calls of a single address to be shorter, and potentially reduce the incorrect uses of find_vma(). This patch (of 22): Many places in the kernel use find_vma() to get a vma and then check the start address of the vma to ensure the next vma was not returned. Other places use the find_vma_intersection() call with add, addr + 1 as the range; looking for just the vma at a specific address. The third use of find_vma() is by developers who do not know that the function starts searching at the provided address upwards for the next vma. This results in a bug that is often overlooked for a long time. Adding the new vma_lookup() function will allow for cleaner code by removing the find_vma() calls which check limits, making find_vma_intersection() calls of a single address to be shorter, and potentially reduce the incorrect uses of find_vma(). Also change find_vma_intersection() comments and declaration to be of the correct length and add kernel documentation style comment. Link: https://lkml.kernel.org/r/20210521174745.2219620-1-Liam.Howlett@Oracle.com Link: https://lkml.kernel.org/r/20210521174745.2219620-2-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: ignore MAP_EXECUTABLE in ksys_mmap_pgoff()David Hildenbrand
Let's also remove masking off MAP_EXECUTABLE from ksys_mmap_pgoff(): the last in-tree occurrence of MAP_EXECUTABLE is now in LEGACY_MAP_MASK, which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is ignored throughout the kernel now. Add a comment to LEGACY_MAP_MASK stating that MAP_EXECUTABLE is ignored. Link: https://lkml.kernel.org/r/20210421093453.6904-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: remove trailing semicolon in macrosHuilong Deng
Macros should not use a trailing semicolon. Link: https://lkml.kernel.org/r/20210614091530.22117-1-denghuilong@cdjrlc.com Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29loop: charge i/o to mem and blk cgDan Schatzberg
The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29memcontrol: use flexible-array memberwenhuizhang
Change deprecated zero-length-and-one-element-arrays into flexible array member.Zero-length and one-element arrays detected by Lukas's CodeChecker. Zero/one element arrays cause undefined behaviours if sizeof() used. Link: https://lkml.kernel.org/r/20210518200910.29912-1-wenhui@gwmail.gwu.edu Signed-off-by: wenhuizhang <wenhui@gwmail.gwu.edu> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvecMuchun Song
lruvec_holds_page_lru_lock() doesn't check anything about locking and is used to check whether the page belongs to the lruvec. So rename it to page_matches_lruvec(). Link: https://lkml.kernel.org/r/20210417043538.9793-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: simplify lruvec_holds_page_lru_lockMuchun Song
We already have a helper lruvec_memcg() to get the memcg from lruvec, we do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat to simplify the code. We can have a single definition for this function that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and CONFIG_MEMCG. Link: https://lkml.kernel.org/r/20210417043538.9793-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvecMuchun Song
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: memcg/slab: create a new set of kmalloc-cg-<n> cachesWaiman Long
There are currently two problems in the way the objcg pointer array (memcg_data) in the page structure is being allocated and freed. On its allocation, it is possible that the allocated objcg pointer array comes from the same slab that requires memory accounting. If this happens, the slab will never become empty again as there is at least one object left (the obj_cgroup array) in the slab. When it is freed, the objcg pointer array object may be the last one in its slab and hence causes kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. One way to solve this problem is to split the kmalloc-<n> caches (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All the other caches can still allow a mix of accounted and unaccounted objects. With this change, all the objcg pointer array objects will come from KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So both the recursive kfree() problem and non-freeable slab problem are gone. Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have mixed accounted and unaccounted objects, this will slightly reduce the number of objcg pointer arrays that need to be allocated and save a bit of memory. On the other hand, creating a new set of kmalloc caches does have the effect of reducing cache utilization. So it is properly a wash. The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches() will include the newly added caches without change. [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem] Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com [akpm@linux-foundation.org: un-fat-finger v5 delta creation] [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches] Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> [longman@redhat.com: fix for CONFIG_ZONE_DMA=n] Suggested-by: Roman Gushchin <guro@fb.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: free idle swap cache page after COWHuang Ying
With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after COW, the idle swap cache page (neither the page nor the corresponding swap entry is mapped by any process) will be left in the LRU list, even if it's in the active list or the head of the inactive list. So, the page reclaimer may take quite some overhead to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache page will be tried to be freed. To avoid to introduce much overhead to the hot COW code path, a) there's almost zero overhead for non-swap case via checking PageSwapCache() firstly. b) the page lock is acquired via trylock only. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 23.8% with the decreased size of swap cache and swapin throughput. Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29swap: fix do_swap_page() race with swapoffMiaohe Lin
When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- do_swap_page if (data_race(si->flags & SWP_SYNCHRONOUS_IO) swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff .. p->swap_file = NULL; .. struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[oops!] Note that for the pages that are swapped in through swap cache, this isn't an issue. Because the page is locked, and the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been unlocked. Fix this race by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/swapfile: use percpu_ref to serialize against concurrent swapoffMiaohe Lin
Patch series "close various race windows for swap", v6. When I was investigating the swap code, I found some possible race windows. This series aims to fix all these races. But using current get/put_swap_device() to guard against concurrent swapoff for swap_readpage() looks terrible because swap_readpage() may take really long time. And to reduce the performance overhead on the hot-path as much as possible, it appears we can use the percpu_ref to close this race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support for swap and most of the remaining patches try to use this to close various race windows. More details can be found in the respective changelogs. This patch (of 4): Using current get/put_swap_device() to guard against concurrent swapoff for some swap ops, e.g. swap_readpage(), looks terrible because they might take really long time. This patch adds the percpu_ref support to serialize against concurrent swapoff(as suggested by Huang, Ying). Also we remove the SWP_VALID flag because it's used together with RCU solution. Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: gup: pack has_pinned in MMF_HAS_PINNEDAndrea Arcangeli
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: move page dirtying prototypes from mm.hMatthew Wilcox (Oracle)
These functions implement the address_space ->set_page_dirty operation and should live in pagemap.h, not mm.h so that the rest of the kernel doesn't get funny ideas about calling them directly. Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29fs: remove noop_set_page_dirty()Matthew Wilcox (Oracle)
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29iomap: use __set_page_dirty_nobuffersMatthew Wilcox (Oracle)
The only difference between iomap_set_page_dirty() and __set_page_dirty_nobuffers() is that the latter includes a debugging check that a !Uptodate page has private data. Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/writeback: move __set_page_dirty() to core mmMatthew Wilcox (Oracle)
Patch series "Further set_page_dirty cleanups". Prompted by Christoph's recent patches, here are some more patches to improve the state of set_page_dirty(). They're all from the folio tree, so they've been tested to a certain extent. This patch (of 6): Nothing in __set_page_dirty() is specific to buffer_head, so move it to mm/page-writeback.c. That removes the only caller of account_page_dirtied() outside of page-writeback.c, so make it static. Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29fs: move ramfs_aops to libfsChristoph Hellwig
Move the ramfs aops to libfs and reuse them for kernfs and configfs. Thosw two did not wire up ->set_page_dirty before and now get __set_page_dirty_no_writeback, which is the right one for no-writeback address_space usage. Drop the now unused exports of the libfs helpers only used for ramfs-style pagecache usage. Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29writeback, cgroup: release dying cgwbs by switching attached inodesRoman Gushchin
Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are significantly larger structures (mostly due to large per-cpu statistics data). This prevents memory waste and helps to avoid different scalability problems caused by large piles of dying cgroups. Reuse the existing mechanism of inode switching used for foreign inode detection. To speed things up batch up to 115 inode switching in a single operation (the maximum number is selected so that the resulting struct inode_switch_wbs_context can fit into 1024 bytes). Because every switching consists of two steps divided by an RCU grace period, it would be too slow without batching. Please note that the whole batch counts as a single operation (when increasing/decreasing isw_nr_in_flight). This allows to keep umounting working (flush the switching queue), however prevents cleanups from consuming the whole switching quota and effectively blocking the frn switching. A cgwb cleanup operation can fail due to different reasons (e.g. not enough memory, the cgwb has an in-flight/pending io, an attached inode in a wrong state, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. [guro@fb.com: replace open-coded "115" with arithmetic] Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com [guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()] Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com [willy@infradead.org: fix documentation] Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29writeback, cgroup: support switching multiple inodes at onceRoman Gushchin
Currently only a single inode can be switched to another writeback structure at once. That means to switch an inode a separate inode_switch_wbs_context structure must be allocated, and a separate rcu callback and work must be scheduled. It's fine for the existing ad-hoc switching, which is not happening that often, but sub-optimal for massive switching required in order to release a writeback structure. To prepare for it, let's add a support for switching multiple inodes at once. Instead of containing a single inode pointer, inode_switch_wbs_context will contain a NULL-terminated array of inode pointers. inode_do_switch_wbs() will be called for each inode. To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's list_locks will be acquired and released only once altogether for all inodes. wb_wakeup() will be also be called only once. Instead of calling wb_put(old_wb) after each successful switch, wb_put_many() is introduced and used. Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29writeback, cgroup: keep list of inodes attached to bdi_writebackRoman Gushchin
Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive. While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure. This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently. Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_reporting: allow driver to specify reporting orderGavin Shan
The page reporting order (threshold) is sticky to @pageblock_order by default. The page reporting can never be triggered because the freeing page can't come up with a free area like that huge. The situation becomes worse when the system memory becomes heavily fragmented. For example, the following configurations are used on ARM64 when 64KB base page size is enabled. In this specific case, the page reporting won't be triggered until the freeing page comes up with a 512MB free area. That's hard to be met, especially when the system memory becomes heavily fragmented. PAGE_SIZE: 64KB HPAGE_SIZE: 512MB pageblock_order: 13 (512MB) MAX_ORDER: 14 This allows the drivers to specify the page reporting order when the page reporting device is registered. It falls back to @pageblock_order if it's not specified by the driver. The existing users (hv_balloon and virtio_balloon) don't specify it and @pageblock_order is still taken as their page reporting order. So this shouldn't introduce any functional changes. Link: https://lkml.kernel.org/r/20210625014710.42954-4-gshan@redhat.com Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29slub: force on no_hash_pointers when slub_debug is enabledStephen Boyd
Obscuring the pointers that slub shows when debugging makes for some confusing slub debug messages: Padding overwritten. 0x0000000079f0674a-0x000000000d4dce17 Those addresses are hashed for kernel security reasons. If we're trying to be secure with slub_debug on the commandline we have some big problems given that we dump whole chunks of kernel memory to the kernel logs. Let's force on the no_hash_pointers commandline flag when slub_debug is on the commandline. This makes slub debug messages more meaningful and if by chance a kernel address is in some slub debug object dump we will have a better chance of figuring out what went wrong. Note that we don't use %px in the slub code because we want to reduce the number of places that %px is used in the kernel. This also nicely prints a big fat warning at kernel boot if slub_debug is on the commandline so that we know that this kernel shouldn't be used on production systems. [akpm@linux-foundation.org: fix build with CONFIG_SLUB_DEBUG=n] Link: https://lkml.kernel.org/r/20210601182202.3011020-5-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm, slub: change run-time assertion in kmalloc_index() to compile-timeHyeonggon Yoo
Currently when size is not supported by kmalloc_index, compiler will generate a run-time BUG() while compile-time error is also possible, and better. So change BUG to BUILD_BUG_ON_MSG to make compile-time check possible. Also remove code that allocates more than 32MB because current implementation supports only up to 32MB. [42.hyeyoo@gmail.com: fix support for clang 10] Link: https://lkml.kernel.org/r/20210518181247.GA10062@hyeyoo [vbabka@suse.cz: fix false-positive assert in kernel/bpf/local_storage.c] Link: https://lkml.kernel.org/r/bea97388-01df-8eac-091b-a3c89b4a4a09@suse.czLink: https://lkml.kernel.org/r/20210511173448.GA54466@hyeyoo [elver@google.com: kfence fix] Link: https://lkml.kernel.org/r/20210512195227.245000695c9014242e9a00e5@linux-foundation.org Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Marco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29kthread: switch to new kerneldoc syntax for named variable macro argumentJonathan Neuschäfer
The syntax without dots is available since commit 43756e347f21 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Link: https://lkml.kernel.org/r/20210513161702.1721039-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Cc: Jens Axboe <axboe@kernel.dk> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/page_alloc: fix memory map initialization for descending nodesMike Rapoport
On systems with memory nodes sorted in descending order, for instance Dell Precision WorkStation T5500, the struct pages for higher PFNs and respectively lower nodes, could be overwritten by the initialization of struct pages corresponding to the holes in the memory sections. For example for the below memory layout [ 0.245624] Early memory node ranges [ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff] [ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] [ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff] [ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff] the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in the middle of a section and will be considered as a hole during the initialization of the last section in node 1. The wrong initialization of the memory map causes panic on boot when CONFIG_DEBUG_VM is enabled. Reorder loop order of the memory map initialization so that the outer loop will always iterate over populated memory regions in the ascending order and the inner loop will select the zone corresponding to the PFN range. This way initialization of the struct pages for the memory holes will be always done for the ranges that are actually not populated. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073 Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Boris Petkov <bp@alien8.de> Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29Merge branches 'pm-domains' and 'pm-devfreq'Rafael J. Wysocki
* pm-domains: PM: domains: Drop/restore performance state votes for devices at runtime PM PM: domains: Return early if perf state is already set for the device PM: domains: Split code in dev_pm_genpd_set_performance_state() PM: domains: fix some kernel-doc issues * pm-devfreq: PM / devfreq: passive: Fix get_target_freq when not using required-opp dt-bindings: devfreq: tegra30-actmon: Add cooling-cells dt-bindings: devfreq: tegra30-actmon: Convert to schema PM / devfreq: userspace: Use DEVICE_ATTR_RW macro PM / devfreq: imx8m-ddrc: Remove DEVFREQ_GOV_SIMPLE_ONDEMAND dependency PM / devfreq: tegra30: Support thermal cooling PM / devfreq: imx-bus: Remove imx_bus_get_dev_status PM / devfreq: Add missing error code in devfreq_add_device()
2021-06-29Merge branches 'pm-core' and 'pm-sleep'Rafael J. Wysocki
* pm-core: PM: runtime: Clarify documentation when callbacks are unassigned PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks PM: runtime: Improve path in rpm_idle() when no callback PM: runtime: document common mistake with pm_runtime_get_sync() * pm-sleep: PM: hibernate: remove leading spaces before tabs PM: sleep: remove trailing spaces and tabs PM: hibernate: fix spelling mistakes PM: wakeirq: Set IRQF_NO_AUTOEN when requesting the IRQ
2021-06-29Merge branches 'acpi-prm', 'acpi-sysfs' and 'acpi-x86'Rafael J. Wysocki
* acpi-prm: ACPI: PRM: make symbol 'prm_module_list' static ACPI: Add \_SB._OSC bit for PRM ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtype * acpi-sysfs: ACPI: sysfs: Remove tailing return statement in void function ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros ACPI: sysfs: Sort headers alphabetically ACPI: sysfs: Refactor param_get_trace_state() to drop dead code ACPI: sysfs: Unify pattern of memory allocations ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe ACPI: sysfs: Make sparse happy about address space in use ACPI: sysfs: fix doc warnings in device_sysfs.c ACPI: sysfs: Drop four redundant return statements ACPI: sysfs: Fix a buffer overrun problem with description_show() * acpi-x86: x86/acpi: Switch to pr_xxx log functions
2021-06-29Merge branches 'acpi-bus', 'acpi-scan' and 'acpi-tables'Rafael J. Wysocki
* acpi-bus: ACPI: Remove redundant clearing of context->ret.pointer from acpi_run_osc() * acpi-scan: ACPI: scan: Simplify acpi_table_events_fn() ACPI: scan: Fix race related to dropping dependencies ACPI: scan: Reorganize acpi_device_add() ACPI: scan: Fix device object rescan in acpi_scan_clear_dep() ACPI: scan: Make acpi_walk_dep_device_list() ACPI: scan: Rearrange acpi_dev_get_first_consumer_dev_cb() ACPI: scan: Define acpi_bus_put_acpi_device() as static inline ACPI: scan: initialize local variable to avoid garbage being returned ACPI: scan: Add function to fetch dependent of ACPI device ACPI: scan: Extend acpi_walk_dep_device_list() ACPI: scan: Rearrange dep_unmet initialization * acpi-tables: ACPI: tables: Add custom DSDT file as makefile prerequisite ACPI: bgrt: Use sysfs_emit ACPI: bgrt: Fix CFI violation ACPI: tables: FPDT: Add missing acpi_put_table() in acpi_init_fpdt() ACPI: tables: PPTT: Populate cache-id if provided by firmware
2021-06-29Merge branch 'printk-rework' into for-linusPetr Mladek
2021-06-29Merge branch 'for-5.14-vsprintf-scanf' into for-linusPetr Mladek
2021-06-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace rlimit handling update from Eric Biederman: "This is the work mainly by Alexey Gladkov to limit rlimits to the rlimits of the user that created a user namespace, and to allow users to have stricter limits on the resources created within a user namespace." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: cred: add missing return error code when set_cred_ucounts() failed ucounts: Silence warning in dec_rlimit_ucounts ucounts: Set ucount_max to the largest positive value the type can hold kselftests: Add test to check for rlimit changes in different user namespaces Reimplement RLIMIT_MEMLOCK on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_NPROC on top of ucounts Use atomic_t for ucounts reference counting Add a reference to ucounts for each cred Increase size of ucounts to atomic_long_t
2021-06-28Merge tag 'pstore-v5.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Use normal block device I/O path for pstore/blk. (Christoph Hellwig, Kees Cook, Pu Lehui)" * tag 'pstore-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/blk: Include zone in pstore_device_info pstore/blk: Fix kerndoc and redundancy on blkdev param pstore/blk: Use the normal block device I/O path pstore/blk: Move verify_size() macro out of function pstore/blk: Improve failure reporting
2021-06-28Merge tag 'docs-5.14' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "This was a reasonably active cycle for documentation; this includes: - Some kernel-doc cleanups. That script is still regex onslaught from hell, but it has gotten a little better. - Improvements to the checkpatch docs, which are also used by the tool itself. - A major update to the pathname lookup documentation. - Elimination of :doc: markup, since our automarkup magic can create references from filenames without all the extra noise. - The flurry of Chinese translation activity continues. Plus, of course, the usual collection of updates, typo fixes, and warning fixes" * tag 'docs-5.14' of git://git.lwn.net/linux: (115 commits) docs: path-lookup: use bare function() rather than literals docs: path-lookup: update symlink description docs: path-lookup: update get_link() ->follow_link description docs: path-lookup: update WALK_GET, WALK_PUT desc docs: path-lookup: no get_link() docs: path-lookup: update i_op->put_link and cookie description docs: path-lookup: i_op->follow_link replaced with i_op->get_link docs: path-lookup: Add macro name to symlink limit description docs: path-lookup: remove filename_mountpoint docs: path-lookup: update do_last() part docs: path-lookup: update path_mountpoint() part docs: path-lookup: update path_to_nameidata() part docs: path-lookup: update follow_managed() part docs: Makefile: Use CONFIG_SHELL not SHELL docs: Take a little noise out of the build process docs: x86: avoid using ReST :doc:`foo` markup docs: virt: kvm: s390-pv-boot.rst: avoid using ReST :doc:`foo` markup docs: userspace-api: landlock.rst: avoid using ReST :doc:`foo` markup docs: trace: ftrace.rst: avoid using ReST :doc:`foo` markup docs: trace: coresight: coresight.rst: avoid using ReST :doc:`foo` markup ...
2021-06-28Merge tag 'integrity-v5.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity subsystem updates from Mimi Zohar: "The large majority of the changes are EVM portable & immutable signature related: removing a dependency on loading an HMAC key, safely allowing file metadata included in the EVM portable & immutable signatures to be modified, allowing EVM signatures to fulfill IMA file signature policy requirements, including the EVM file metadata signature in lieu of an IMA file data signature in the measurement list, and adding dynamic debugging of EVM file metadata. In addition, in order to detect critical data or file change reversions, duplicate measurement records are permitted in the IMA measurement list. The remaining patches address compiler, sparse, and doc warnings" * tag 'integrity-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: (31 commits) evm: Check xattr size discrepancy between kernel and user evm: output EVM digest calculation info IMA: support for duplicate measurement records ima: Fix warning: no previous prototype for function 'ima_add_kexec_buffer' ima: differentiate between EVM failures in the audit log ima: Fix fall-through warning for Clang ima: Pass NULL instead of 0 to ima_get_action() in ima_file_mprotect() ima: Include header defining ima_post_key_create_or_update() ima/evm: Fix type mismatch ima: Set correct casting types doc: Fix warning in Documentation/security/IMA-templates.rst evm: Don't return an error in evm_write_xattrs() if audit is not enabled ima: Define new template evm-sig ima: Define new template fields xattrnames, xattrlengths and xattrvalues evm: Verify portable signatures against all protected xattrs ima: Define new template field imode ima: Define new template fields iuid and igid ima: Add ima_show_template_uint() template library function ima: Don't remove security.ima if file must not be appraised ima: Introduce template field evmsig and write to field sig as fallback ...
2021-06-28Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "Algorithms: - Fix rmmod crash with x86/curve25519 - Add ECDH NIST P384 - Generate assembly files at build-time with perl scripts on arm - Switch to HMAC SHA512 DRBG as default DRBG Drivers: - Add sl3516 crypto engine - Add ECDH NIST P384 support in hisilicon/hpre - Add {ofb,cfb,ctr} over {aes,sm4} in hisilicon/sec - Add {ccm,gcm} over {aes,sm4} in hisilicon/sec - Enable omap hwrng driver for TI K3 family - Add support for AEAD algorithms in qce" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (142 commits) crypto: sl3516 - depends on HAS_IOMEM crypto: hisilicon/qm - implement for querying hardware tasks status. crypto: sl3516 - Fix build warning without CONFIG_PM MAINTAINERS: update caam crypto driver maintainers list crypto: nx - Fix numerous sparse byte-order warnings crypto: nx - Fix RCU warning in nx842_OF_upd_status crypto: api - Move crypto attr definitions out of crypto.h crypto: nx - Fix memcpy() over-reading in nonce crypto: hisilicon/sec - Fix spelling mistake "fallbcak" -> "fallback" crypto: sa2ul - Remove unused auth_len variable crypto: sl3516 - fix duplicated inclusion crypto: hisilicon/zip - adds the max shaper type rate crypto: hisilicon/hpre - adds the max shaper type rate crypto: hisilicon/sec - adds the max shaper type rate crypto: hisilicon/qm - supports to inquiry each function's QoS crypto: hisilicon/qm - add pf ping single vf function crypto: hisilicon/qm - merges the work initialization process into a single function crypto: hisilicon/qm - add the "alg_qos" file node crypto: hisilicon/qm - supports writing QoS int the host crypto: api - remove CRYPTOA_U32 and related functions ...
2021-06-28once: implement DO_ONCE_LITE for non-fast-path "do once" functionalityTanner Love
Certain uses of "do once" functionality reside outside of fast path, and so do not require jump label patching via static keys, making existing DO_ONCE undesirable in such cases. Replace uses of __section(".data.once") with DO_ONCE_LITE(_IF)? This patch changes the return values of xfs_printk_once, printk_once, and printk_deferred_once. Before, they returned whether the print was performed, but now, they always return true. This is okay because the return values of the following macros are entirely ignored throughout the kernel: - xfs_printk_once - xfs_warn_once - xfs_notice_once - xfs_info_once - printk_once - pr_emerg_once - pr_alert_once - pr_crit_once - pr_err_once - pr_warn_once - pr_notice_once - pr_info_once - pr_devel_once - pr_debug_once - printk_deferred_once - orc_warn Changes v3: - Expand commit message to explain why changing return values of xfs_printk_once, printk_once, printk_deferred_once is benign v2: - Fix i386 build warnings Signed-off-by: Tanner Love <tannerlove@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "This covers all architectures (except MIPS) so I don't expect any other feature pull requests this merge window. ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits) KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled kvm: x86: disable the narrow guest module parameter on unload selftests: kvm: Allows userspace to handle emulation errors. kvm: x86: Allow userspace to handle emulation errors KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic KVM: x86: Enhance comments for MMU roles and nested transition trickiness KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU KVM: x86/mmu: Use MMU's role to determine PTTYPE KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers KVM: x86/mmu: Add a helper to calculate root from role_regs KVM: x86/mmu: Add helper to update paging metadata KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper KVM: x86/mmu: Get nested MMU's root level from the MMU's role ...
2021-06-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'mlx5-updates-2021-06-26' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-06-26 This series provides small updates to mlx5 driver. 1) Increase hairpin buffer size 2) Improve peroformance in SF allocation 3) Add IPsec support to uplink representor 4) Add stats for number of deleted kTLS TX offloaded connections 5) Add support for flow sampler in SW steering ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: bridge: allow the switchdev replay functions to be called for deletionVladimir Oltean
When a switchdev port leaves a LAG that is a bridge port, the switchdev objects and port attributes offloaded to that port are not removed: ip link add br0 type bridge ip link add bond0 type bond mode 802.3ad ip link set swp0 master bond0 ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 nomaster VLAN 100 will remain installed on swp0 despite it going into standalone mode, because as far as the bridge is concerned, nothing ever happened to its bridge port. Let's extend the bridge vlan, fdb and mdb replay functions to take a 'bool adding' argument, and make DSA and ocelot call the replay functions with 'adding' as false from the switchdev unsync path, for the switch port that leaves the bridge. Note that this patch in itself does not salvage anything, because in the current pull mode of operation, DSA still needs to call the replay helpers with adding=false. This will be done in another patch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: bridge: constify variables in the replay helpersVladimir Oltean
Some of the arguments and local variables for the newly added switchdev replay helpers can be const, so let's make them so. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: bridge: ignore switchdev events for LAG ports which didn't request replayVladimir Oltean
There is a slight inconvenience in the switchdev replay helpers added recently, and this is when: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 master bond0 ip link set swp1 master bond0 Since the underlying driver (currently only DSA) asks for a replay of VLANs when swp0 and swp1 join the LAG because it is bridged, what will happen is that DSA will try to react twice on the VLAN event for swp0. This is not really a huge problem right now, because most drivers accept duplicates since the bridge itself does, but it will become a problem when we add support for replaying switchdev object deletions. Let's fix this by adding a blank void *ctx in the replay helpers, which will be passed on by the bridge in the switchdev notifications. If the context is NULL, everything is the same as before. But if the context is populated with a valid pointer, the underlying switchdev driver (currently DSA) can use the pointer to 'see through' the bridge port (which in the example above is bond0) and 'know' that the event is only for a particular physical port offloading that bridge port, and not for all of them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's a reasonable amount here and the juicy details are all below. It's worth noting that the MTE/KASAN changes strayed outside of our usual directories due to core mm changes and some associated changes to some other architectures; Andrew asked for us to carry these [1] rather that take them via the -mm tree. Summary: - Optimise SVE switching for CPUs with 128-bit implementations. - Fix output format from SVE selftest. - Add support for versions v1.2 and 1.3 of the SMC calling convention. - Allow Pointer Authentication to be configured independently for kernel and userspace. - PMU driver cleanups for managing IRQ affinity and exposing event attributes via sysfs. - KASAN optimisations for both hardware tagging (MTE) and out-of-line software tagging implementations. - Relax frame record alignment requirements to facilitate 8-byte alignment with KASAN and Clang. - Cleanup of page-table definitions and removal of unused memory types. - Reduction of ARCH_DMA_MINALIGN back to 64 bytes. - Refactoring of our instruction decoding routines and addition of some missing encodings. - Move entry code moved into C and hardened against harmful compiler instrumentation. - Update booting requirements for the FEAT_HCX feature, added to v8.7 of the architecture. - Fix resume from idle when pNMI is being used. - Additional CPU sanity checks for MTE and preparatory changes for systems where not all of the CPUs support 32-bit EL0. - Update our kernel string routines to the latest Cortex Strings implementation. - Big cleanup of our cache maintenance routines, which were confusingly named and inconsistent in their implementations. - Tweak linker flags so that GDB can understand vmlinux when using RELR relocations. - Boot path cleanups to enable early initialisation of per-cpu operations needed by KCSAN. - Non-critical fixes and miscellaneous cleanup" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits) arm64: tlb: fix the TTL value of tlb_get_level arm64: Restrict undef hook for cpufeature registers arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS arm64: insn: avoid circular include dependency arm64: smp: Bump debugging information print down to KERN_DEBUG drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe() perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number arm64: suspend: Use cpuidle context helpers in cpu_suspend() PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter() arm64: Convert cpu_do_idle() to using cpuidle context helpers arm64: Add cpuidle context save/restore helpers arm64: head: fix code comments in set_cpu_boot_mode_flag arm64: mm: drop unused __pa(__idmap_text_start) arm64: mm: fix the count comments in compute_indices arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan arm64: mm: Pass original fault address to handle_mm_fault() arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK] arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT arm64/mm: Drop SWAPPER_INIT_MAP_SIZE arm64: Conditionally configure PTR_AUTH key of the kernel. ...
2021-06-28Merge tag 'mac80211-next-for-net-next-2021-06-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes berg says: ==================== Lots of changes: * aggregation handling improvements for some drivers * hidden AP discovery on 6 GHz and other HE 6 GHz improvements * minstrel improvements for no-ack frames * deferred rate control for TXQs to improve reaction times * virtual time-based airtime scheduler * along with various little cleanups/fixups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>