Age | Commit message (Collapse) | Author |
|
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can
share its implementation.
Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Min Zhou <zhoumin@loongson.cn>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a
virtually mapped memmap to optimise pfn_to_page and page_to_pfn
operations. This is the most efficient option when sufficient kernel
resources are available.
Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn
Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/sparse-vmemmap: Generalise helpers and enable for
LoongArch", v14.
This series is in order to enable sparse-vmemmap for LoongArch. But
LoongArch cannot use generic helpers directly because MIPS&LoongArch need
to call pgd_init()/pud_init()/pmd_init() when populating page tables. So
we adjust the prototypes of p?d_init() to make generic helpers can call
them, then enable sparse-vmemmap with generic helpers, and to be further,
generalise vmemmap_populate_hugepages() for ARM64, X86 and LoongArch.
This patch (of 4):
We are preparing to add sparse vmemmap support to LoongArch. MIPS and
LoongArch need to call pgd_init()/pud_init()/pmd_init() when populating
page tables, so adjust their prototypes to make generic helpers can call
them.
NIOS2 declares pmd_init() but doesn't use, just remove it to avoid build
errors.
Link: https://lkml.kernel.org/r/20221027125253.3458989-1-chenhuacai@loongson.cn
Link: https://lkml.kernel.org/r/20221027125253.3458989-2-chenhuacai@loongson.cn
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xuefeng Li <lixuefeng@loongson.cn>
Cc: Xuerui Wang <kernel@xen0n.name>
Cc: Min Zhou <zhoumin@loongson.cn>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In certain cases (e.g. when handling a softirq)
__msan_instrument_asm_store(&var, sizeof(var)) may be called with from
within KMSAN runtime, but later the value of @var is used with
!kmsan_in_runtime(), leading to false positives.
Because kmsan_internal_unpoison_memory() doesn't take locks, it should be
fine to call it without kmsan_in_runtime() checks, which fixes the
mentioned false positives.
Link: https://lkml.kernel.org/r/20221128094541.2645890-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Lockdep and KMSAN used to play badly together, causing deadlocks when
KMSAN instrumentation of lockdep.c called lockdep functions recursively.
Looks like this is no more the case, and a kernel can run (yet slower)
with both KMSAN and lockdep enabled. This patch should fix false
positives on wq_head->lock->dep_map, which KMSAN used to consider
uninitialized because of lockdep.c not being instrumented.
Link: https://lore.kernel.org/lkml/Y3b9AAEKp2Vr3e6O@sol.localdomain/
Link: https://lkml.kernel.org/r/20221128094541.2645890-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Return value from ptep_get_and_clear_full() directly instead of taking
this in another redundant variable.
Link: https://lkml.kernel.org/r/202211282107437343474@zte.com.cn
Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
generic_fadvise() sets endbyte = -1 to specify end of file (i.e. if
length == 0 is passed from userspace). Most other callers to
filemap_fdatawrite_range() use LLONG_MAX for this purpose, particularly if
they also call fdatawait_range() (which requires end >= start). For
example, sync_file_range(), vfs_fsync() (where the range is passed down
through per-fs ->fsync() callbacks), filemap_flush(), etc.
generic_fadvise() does not currently wait on writeback, but fix the call
up to be consistent with other callers.
Link: https://lkml.kernel.org/r/20221128155632.3950447-3-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "filemap: skip write and wait if end offset precedes start",
v2.
A fix for the odd write and wait behavior described in the patch 1 commit
log. Technically patch 1 could simply remove the check rather than lift
it into the callers, but this seemed a bit more user friendly to me.
Patch 2 is appended after observation that fadvise() interacted poorly
with the v1 patch. This is no longer a problem with v2, making patch 2
purely a cleanup.
This series survived both fstests and ltp regression runs without
observable problems. I had (end < start) warning checks in each relevant
function, with fadvise() being the only caller that triggered them. That
said, I dropped the warnings after testing because there seemed to much
potential for noise from the various other callers.
This patch (of 2):
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can trigger
writeback submission but fails to wait on the submitted page. Writeback
submission occurs because __filemap_fdatawrite_range() passes both offsets
down into write_cache_pages(), which rounds down to page indexes before it
starts processing writeback. However, __filemap_fdatawait_range()
immediately returns if the byte-granular end offset precedes the start
offset.
This behavior was observed in the form of unpredictable latency from a
frequent write and wait call with incorrect parameters. The behavior gave
the impression that the fdatawait path might occasionally fail to wait on
writeback, but further investigation showed the latency was from
write_cache_pages() waiting on writeback state to clear for a page already
under writeback. Therefore, this indicated that fdatawait actually never
waits on writeback in this particular situation.
The byte granular check in __filemap_fdatawait_range() goes all the way
back to the old wait_on_page_writeback() helper. It originally used page
offsets and so would have waited in this problematic case. That changed
to byte granularity file offsets in commit 94004ed726f3 ("kill
wait_on_page_writeback_range"), which subtly changed this behavior. The
check itself has become somewhat redundant since the error checking code
that used to follow the wait loop (at the time of the aforementioned
commit) has now been removed and lifted into the higher level callers.
Therefore, we can restore historical fdatawait behavior by simply removing
the check. Since the current fdatawait behavior has been in place for
quite some time and is consistent with other interfaces that use file
offsets, instead lift the check into the file[map]_write_and_wait_range()
helpers to provide consistent behavior between the write and wait.
Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This commit adds the writeback mechanism for zsmalloc, analogous to the
zbud allocator. Zsmalloc will attempt to determine the coldest zspage
(i.e least recently used) in the pool, and attempt to write back all the
stored compressed objects via the pool's evict handler.
Link: https://lkml.kernel.org/r/20221128191616.1261026-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This adds a new field to zs_pool to store evict handlers for writeback,
analogous to the zbud allocator.
Link: https://lkml.kernel.org/r/20221128191616.1261026-6-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This helps determines the coldest zspages as candidates for writeback.
Link: https://lkml.kernel.org/r/20221128191616.1261026-5-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, zsmalloc has a hierarchy of locks, which includes a pool-level
migrate_lock, and a lock for each size class. We have to obtain both
locks in the hotpath in most cases anyway, except for zs_malloc. This
exception will no longer exist when we introduce a LRU into the zs_pool
for the new writeback functionality - we will need to obtain a pool-level
lock to synchronize LRU handling even in zs_malloc.
In preparation for zsmalloc writeback, consolidate these locks into a
single pool-level lock, which drastically reduces the complexity of
synchronization in zsmalloc.
We have also benchmarked the lock consolidation to see the performance
effect of this change on zram.
First, we ran a synthetic FS workload on a server machine with 36 cores
(same machine for all runs), using
fs_mark -d ../zram1mnt -s 100000 -n 2500 -t 32 -k
before and after for btrfs and ext4 on zram (FS usage is 80%).
Here is the result (unit is file/second):
With lock consolidation (btrfs):
Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028
Without lock consolidation (btrfs):
Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665
With lock consolidation (ext4):
Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668
Without lock consolidation (ext4)
Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469
As you can see, we observe a 0.3% regression for btrfs, and a 0.9%
regression for ext4. This is a small, barely measurable difference in my
opinion.
For a more realistic scenario, we also tries building the kernel on zram.
Here is the time it takes (in seconds):
With lock consolidation (btrfs):
real
Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159
user
Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656
sys
Average: 521.4, Median: 522.0, Stddev: 1.51657508881031
Without lock consolidation (btrfs):
real
Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756
user
Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023
sys
Average: 520.6, Median: 521.0, Stddev: 1.140175425099138
With lock consolidation (ext4):
real
Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951
user
Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307
sys
Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317
Without lock consolidation (ext4)
real
Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159
user
Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523
sys
Average: 520.4, Median: 520.0, Stddev: 1.140175425099138
The difference is entirely within the noise of a typical run on zram.
This hardly justifies the complexity of maintaining both the pool lock and
the class lock. In fact, for writeback, we would need to introduce yet
another lock to prevent data races on the pool's LRU, further complicating
the lock handling logic. IMHO, it is just better to collapse all of these
into a single pool-level lock.
Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is a lot of provision for flexibility that isn't actually needed or
used. Zswap (the only zpool user) always passes zpool_ops with an .evict
method set. The backends who reclaim only do so for zswap, so they can
also directly call zpool_ops without indirection or checks.
Finally, there is no need to check the retries parameters and bail with
-EINVAL in the reclaim function, when that's called just a few lines below
with a hard-coded 8. There is no need to duplicate the evictable and
sleep_mapped attrs from the driver in zpool_ops.
Link: https://lkml.kernel.org/r/20221128191616.1261026-3-nphamcs@gmail.com
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Implement writeback for zsmalloc", v7.
Unlike other zswap allocators such as zbud or z3fold, zsmalloc currently
lacks the writeback mechanism. This means that when the zswap pool is
full, it will simply reject further allocations, and the pages will be
written directly to swap.
This series of patches implements writeback for zsmalloc. When the zswap
pool becomes full, zsmalloc will attempt to evict all the compressed
objects in the least-recently used zspages.
This patch (of 6):
zswap's customary lock order is tree->lock before pool->lock, because the
tree->lock protects the entries' refcount, and the free callbacks in the
backends acquire their respective pool locks to dispatch the backing
object. zsmalloc's map callback takes the pool lock, so zswap must not
grab the tree->lock while a handle is mapped. This currently only happens
during writeback, which isn't implemented for zsmalloc. In preparation
for it, move the tree->lock section out of the mapped entry section
Link: https://lkml.kernel.org/r/20221128191616.1261026-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20221128191616.1261026-2-nphamcs@gmail.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When MADV_PAGEOUT is called on a private file mapping VMA region, we bail
out early if the process is neither owner nor write capable of the file.
However, this VMA may have both private/shared clean pages and private
dirty pages. The opportunity of paging out the private dirty pages (Anon
pages) is missed. Fix this behavior by allowing private file mappings
pageout further and perform the file access check along with PageAnon()
during page walk.
We observe ~10% improvement in zram usage, thus leaving more available
memory on a 4GB RAM system running Android.
[quic_pkondeti@quicinc.com: v2]
Link: https://lkml.kernel.org/r/1669962597-27724-1-git-send-email-quic_pkondeti@quicinc.com
Link: https://lkml.kernel.org/r/1667971116-12900-1-git-send-email-quic_pkondeti@quicinc.com
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
"mm_khugepaged_collapse_file" for capturing is_shmem.
Currently, is_shmem is not being captured. Capturing is_shmem is useful
as it can indicate if tmpfs is being used as a backing store instead of
persistent storage. Add the tracepoint in collapse_file() named
"mm_khugepaged_collapse_file" for capturing is_shmem.
[gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt]
Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com
Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.com
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> [tracing]
Cc: David Hildenbrand <david@redhat.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fortunately, the last user (KSM) is gone, so let's just remove this rather
special code from generic GUP handling -- especially because KSM never
required the PMD handling as KSM only deals with individual base pages.
[akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually,
there is not even the need to wait for the migration to finish, we only
want to know if we're dealing with a KSM page.
Using follow_page() just to identify a KSM page overcomplicates GUP code.
Let's use walk_page_range_vma() instead, because we don't actually care
about the page itself, we only need to know a single property -- no need
to even grab a reference.
So, get rid of follow_page() usage such that we can get rid of
FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
the future.
In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s). I
don't think we particularly care for now.
Interestingly, the benchmark reduction is due to the single callback.
Adding a second callback (e.g., pud_entry()) reduces the benchmark by
another 100-200 MiB/s.
Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
however, is only interested in a subset of the VMA range.
To be used in KSM code to stop using follow_page() next.
Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's stop breaking COW via a fake write fault and let's use
FAULT_FLAG_UNSHARE instead. This avoids any wrong side effects of the
fake write fault, such as mapping the PTE writable and marking the pte
dirty/softdirty.
Consequently, we will no longer trigger a fake write fault and break COW
without any such side-effects.
Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
0 instead of actually breaking COW.
For now, the KSM unmerge tests can trigger that:
$ sudo ./ksm_functional_tests
TAP version 13
1..3
# [RUN] test_unmerge
ok 1 Pages were unmerged
# [RUN] test_unmerge_discarded
ok 2 Pages were unmerged
# [RUN] test_unmerge_uffd_wp
not ok 3 Pages were unmerged
Bail out! 1 out of 3 tests failed
# Planned tests != run tests (2 != 3)
# Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
The warning in dmesg also indicates this wrong handling:
[ 230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
[ 230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
[ 230.110124] Hardware name: [...]
[ 230.117775] Call Trace:
[ 230.120227] <TASK>
[ 230.122334] dump_stack_lvl+0x44/0x5c
[ 230.126010] handle_userfault.cold+0x14/0x19
[ 230.130281] ? tlb_finish_mmu+0x65/0x170
[ 230.134207] ? uffd_wp_range+0x65/0xa0
[ 230.137959] ? _raw_spin_unlock+0x15/0x30
[ 230.141972] ? do_wp_page+0x50/0x590
[ 230.145551] __handle_mm_fault+0x9f5/0xf50
[ 230.149652] ? mmput+0x1f/0x40
[ 230.152712] handle_mm_fault+0xb9/0x2a0
[ 230.156550] break_ksm+0x141/0x180
[ 230.159964] unmerge_ksm_pages+0x60/0x90
[ 230.163890] ksm_madvise+0x3c/0xb0
[ 230.167295] do_madvise.part.0+0x10c/0xeb0
[ 230.171396] ? do_syscall_64+0x67/0x80
[ 230.175157] __x64_sys_madvise+0x5a/0x70
[ 230.179082] do_syscall_64+0x58/0x80
[ 230.182661] ? do_syscall_64+0x67/0x80
[ 230.186413] entry_SYSCALL_64_after_hwframe+0x63/0xcd
This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
fault was always questionable. As this fix is not easy to backport and
it's not very critical, let's not cc stable.
Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com
Fixes: 529b930b87d9 ("userfaultfd: wp: hook userfault handler to write protection fault")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All users -- GUP and KSM -- are gone, let's just remove it.
Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
remaining user of VM_FAULT_WRITE. As we also want to stop triggering a
fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
(including KSM pages), let's stop relying on VM_FAULT_WRITE.
Let's rework break_ksm() to not rely on the return value of
handle_mm_fault() anymore to figure out whether COW-breaking was
successful. Simply perform another follow_page() lookup to verify the
result.
While this makes break_ksm() slightly less efficient, we can simplify
handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without
introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE.
In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010
MiB/s).
I don't think that we particularly care about that performance drop when
unmerging. If it ever turns out to be an actual performance issue, we can
think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
it simple for now.
Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's add a test to measure performance of KSM breaking not triggered via
COW, but triggered by disabling KSM on an area filled with KSM pages via
MADV_UNMERGEABLE.
Link: https://lkml.kernel.org/r/20221021101141.84170-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As Peter points out, the caller passes a single VMA and can just do that
check itself.
And in fact, no existing users rely on test_walk() getting called. So
let's just remove it and make the implementation slightly more efficient.
Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/ksm: break_ksm() cleanups and fixes", v2.
This series cleans up and fixes break_ksm(). In summary, we no longer use
fake write faults to break COW but instead FAULT_FLAG_UNSHARE. Further,
we move away from using follow_page() --- that we can hopefully remove
completely at one point --- and use new walk_page_range_vma() instead.
Fortunately, we can get rid of VM_FAULT_WRITE and FOLL_MIGRATION in common
code now.
Extend the existing ksm tests by an unmerge benchmark, and a some new
unmerge tests.
Also, add a selftest to measure MADV_UNMERGEABLE performance. In my setup
(AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance
on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a
performance degradation of ~6% -- 7% (old: ~5250 MiB/s, new: ~4900 MiB/s).
I don't think we particularly care for now, but it's good to be aware of
the implication.
This patch (of 9):
Let's add three unmerge tests (MADV_UNMERGEABLE unmerging all pages in the
range).
test_unmerge(): basic unmerge tests
test_unmerge_discarded(): have some pte_none() entries in the range
test_unmerge_uffd_wp(): protect the merged pages using uffd-wp
ksm_tests.c currently contains a mixture of benchmarks and tests, whereby
each test is carried out by executing the ksm_tests binary with specific
parameters. Let's add new ksm_functional_tests.c that performs multiple,
smaller functional tests all at once.
Link: https://lkml.kernel.org/r/20221021101141.84170-1-david@redhat.com
Link: https://lkml.kernel.org/r/20221021101141.84170-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Our memory management kernel CI testing at Red Hat uses the VM
selftests and we have run into two problems:
First, our LTP tests overlap with the VM selftests.
We want to avoid unhelpful redundancy in our testing practices.
Second, we have observed the current run_vmtests.sh to report overall
failure/ambiguous results in the case that a machine lacks the necessary
hardware to perform one or more of the tests. E.g. ksm tests that
require more than one numa node.
We want to be able to run the vm selftests suitable to particular hardware.
Add the ability to run one or more groups of vm tests via run_vmtests.sh
instead of simply all-or-none in order to solve these problems.
Preserve existing default behavior of running all tests when the script
is invoked with no arguments.
Documentation of test groups is included in the patch as follows:
# ./run_vmtests.sh [ -h || --help ]
usage: ./tools/testing/selftests/vm/run_vmtests.sh [ -h | -t "<categories>"]
-t: specify specific categories to tests to run
-h: display this message
The default behavior is to run all tests.
Alternatively, specific groups tests can be run by passing a string
to the -t argument containing one or more of the following categories
separated by spaces:
- mmap
tests for mmap(2)
- gup_test
tests for gup using gup_test interface
- userfaultfd
tests for userfaultfd(2)
- compaction
a test for the patch "Allow compaction of unevictable pages"
- mlock
tests for mlock(2)
- mremap
tests for mremap(2)
- hugevm
tests for very large virtual address space
- vmalloc
vmalloc smoke tests
- hmm
hmm smoke tests
- madv_populate
test memadvise(2) MADV_POPULATE_{READ,WRITE} options
- memfd_secret
test memfd_secret(2)
- process_mrelease
test process_mrelease(2)
- ksm
ksm tests that do not require >=2 NUMA nodes
- ksm_numa
ksm tests that require >=2 NUMA nodes
- pkey
memory protection key tests
- soft_dirty
test soft dirty page bit semantics
- anon_cow
test anonymous copy-on-write semantics
example: ./run_vmtests.sh -t "hmm mmap ksm"
Link: https://lkml.kernel.org/r/20221018231222.1884715-1-jsavitz@redhat.com
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'led' nodes should have a reference to LED common.yaml schema. Add it where
missing and drop any duplicate properties.
Acked-by: Lee Jones <lee@kernel.org>
Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20221207204327.2810001-2-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
The example has 'led-gpio' properties, but that's not documented. As the
'gpio' form is deprecated, add 'led-gpios' to the schema and update the
example.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Lee Jones <lee@kernel.org>
Link: https://lore.kernel.org/r/20221207204327.2810001-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
and find_dup_cset_prop()
When kmalloc() fail to allocate memory in kasprintf(), fn_1 or fn_2 will
be NULL, and strcmp() will cause null pointer dereference.
Fixes: 2fe0e8769df9 ("of: overlay: check prevents multiple fragments touching same property")
Signed-off-by: ruanjinjie <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20221211023337.592266-1-ruanjinjie@huawei.com
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
|
|
Vitor left Synopsys and the email address is now bouncing.
Link: https://lore.kernel.org/r/20221211205539.19353-1-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Because not all I3C drivers have the hot-join feature ready, and
especially not all I3C devices support hot-join feature, exporting
SETDASA method could be useful. With this function, the I3C controller
could perform a DAA to I3C devices when users decide to turn these I3C
devices off and on again during run-time.
Tested: This change has been tested with turnning off an I3C device and
turning on it again during run-time. The device driver calls SETDASA
method to perform DAA to the device. And communication between I3C
controller and device is set up again correctly.
Signed-off-by: Jack Chen <zenghuchen@google.com>
Link: https://lore.kernel.org/r/20221207205059.3848851-1-zenghuchen@google.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
Present definition for module_i3c_i2c_driver uses only the
1st argument i.e., struct i3c_driver.
Irrespective of CONFIG_I3C being enabled/disabled,
struct i2c_driver is never passed to module_driver()
Passing struct i2c_driver as the 4th argument works.
Signed-off-by: Akshay Gupta <Akshay.Gupta@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20221205105413.937704-1-naveenkrishna.chatradhi@amd.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fix from Joerg Roedel:
- Fix device mask to catch all affected devices in the recently added
quirk for QAT devices in the Intel VT-d driver.
* tag 'iommu-fix-v6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix buggy QAT device mask
|
|
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
"flock" is leaked if an error happens before smb2_lock_init(), as the
lock is not added to the lock_list to be cleaned up.
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When ksmbd_rpc_open() fails then it must call ksmbd_rpc_id_free() to
undo the result of ksmbd_ipc_id_alloc().
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element arrays with flexible-array
members in multiple structs in fs/ksmbd/smb_common.h and one in
fs/ksmbd/smb2pdu.h.
Important to mention is that doing a build before/after this patch results
in no binary output differences.
This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines
on memcpy() and help us make progress towards globally enabling
-fstrict-flex-arrays=3 [1].
Link: https://github.com/KSPP/linux/issues/242
Link: https://github.com/KSPP/linux/issues/79
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
ksmbd seems to be trying to use a cmd value of 0 when unlocking a file.
That activity requires a type of F_UNLCK with a cmd of F_SETLK. For
local POSIX locking, it doesn't matter much since vfs_lock_file ignores
@cmd, but filesystems that define their own ->lock operation expect to
see it set sanely.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
this share
Currently, SMB2_SESSION_FLAG_ENCRYPT_DATA is always set session setup
response. Since this forces data encryption from the client, there is a
problem that data is always encrypted regardless of the use of the cifs
seal mount option. SMB2_SESSION_FLAG_ENCRYPT_DATA should be set according
to KSMBD_GLOBAL_FLAG_SMB2_ENCRYPTION flags, and in case of
KSMBD_GLOBAL_FLAG_SMB2_ENCRYPTION_OFF, encryption mode is turned off for
all connections.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There are 2 ways to create IPoIB PKEY child interfaces:
1) Writing a PKEY to /sys/class/net/<ib parent interface>/create_child.
2) Using netlink with iproute.
While with sysfs the child interface has the same number of tx and
rx queues as the parent, with netlink there will always be 1 tx
and 1 rx queue for the child interface. That's because the
get_num_tx/rx_queues() netlink ops are missing and the default value
of 1 is taken for the number of queues (in rtnl_create_link()).
This change adds the get_num_tx/rx_queues() ops which allows for
interfaces with multiple queues to be created over netlink. This
constant only represents the max number of tx and rx queues on that
net device.
Fixes: 9baa0b036410 ("IB/ipoib: Add rtnl_link_ops support")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://lore.kernel.org/r/f4a42c8aa43c02d5ae5559a60c3e5e0f18c82531.1670485816.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The netdev core will detect if any untracked puts are done on tracked
pointers and throw refcount warnings:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 1 PID: 33 at lib/refcount.c:31 refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
Modules linked in:
CPU: 1 PID: 33 Comm: kworker/u4:2 Not tainted 6.1.0-rc8-next-20221207-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: ib-unreg-wq ib_unregister_work
RIP: 0010:refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
Code: 05 5a 60 51 0a 01 e8 35 0a b5 05 0f 0b e9 d3 fe ff ff e8 6c 9b 75 fd 48 c7 c7 c0 6d a6 8a c6 05 37 60 51 0a 01 e8 16 0a b5 05 <0f> 0b e9 b4 fe
+ff ff 48 89 ef e8 5a b5 c3 fd e9 5c fe ff ff 0f 1f
RSP: 0018:ffffc90000aa7b30 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8880172f9d40 RSI: ffffffff8166b1dc RDI: fffff52000154f58
RBP: ffff88807906c600 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000080000001 R11: 0000000000000000 R12: 1ffff92000154f6b
R13: 0000000000000000 R14: ffff88807906c600 R15: ffff888046894000
FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe350a8ff8 CR3: 000000007a9e7000 CR4: 00000000003526e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:344 [inline]
refcount_dec include/linux/refcount.h:359 [inline]
ref_tracker_free+0x539/0x6b0 lib/ref_tracker.c:118
netdev_tracker_free include/linux/netdevice.h:4039 [inline]
netdev_put include/linux/netdevice.h:4056 [inline]
dev_put include/linux/netdevice.h:4082 [inline]
free_netdevs+0x1f8/0x470 drivers/infiniband/core/device.c:2204
__ib_unregister_device+0xa0/0x1a0 drivers/infiniband/core/device.c:1478
ib_unregister_work+0x19/0x30 drivers/infiniband/core/device.c:1586
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
So change the missed dev_put for pdata->netdev to also follow the tracker.
Fixes: 09f530f0c6d6 ("RDMA: Add netdevice_tracker to ib_device_set_netdev()")
Reported-by: syzbot+3fd8326d9a0812d19218@syzkaller.appspotmail.com
Reported-by: syzbot+a1ed8ffe3121380cd5dd@syzkaller.appspotmail.com
Reported-by: syzbot+8d0a099c8a6d1e4e601c@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-e99919867b8d+1e2-netdev_tracker2_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Nine hotfixes.
Six for MM, three for other areas. Four of these patches address
post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
memcg: fix possible use-after-free in memcg_write_event_control()
MAINTAINERS: update Muchun Song's email
mm/gup: fix gup_pud_range() for dax
mmap: fix do_brk_flags() modifying obviously incorrect VMAs
mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit
tmpfs: fix data loss from failed fallocate
kselftests: cgroup: update kmem test precision tolerance
mm: do not BUG_ON missing brk mapping, because userspace can unmap it
mailmap: update Matti Vaittinen's email address
|
|
Change the run_estimation flag to start/stop the kthread tasks.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Estimating all entries in single list in timer context
by single CPU causes large latency with multiple IPVS rules
as reported in [1], [2], [3].
Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. The chains are processed
in multiple (50) timer ticks to ensure the 2-second interval
between estimations with some accuracy. Every chain is
processed under RCU lock.
Every kthread works over its own data structure and all
such contexts are attached to array. The contexts can be
preserved while the kthread tasks are stopped or restarted.
When estimators are removed, unused kthread contexts are
released and the slots in array are left empty.
First kthread determines parameters to use, eg. maximum
number of estimators to process per kthread based on
chain's length (chain_max), allowing sub-100us cond_resched
rate and estimation taking up to 1/8 of the CPU capacity
to avoid any problems if chain_max is not correctly
calculated.
chain_max is calculated taking into account factors
such as CPU speed and memory/cache speed where the
cache_factor (4) is selected from real tests with
current generation of CPU/NUMA configurations to
correct the difference in CPU usage between
cached (during calc phase) and non-cached (working) state
of the estimated per-cpu data.
First kthread also plays the role of distributor of
added estimators to all kthreads, keeping low the
time to add estimators. The optimization is based on
the fact that newly added estimator should be estimated
after 2 seconds, so we have the time to offload the
adding to chain from controlling process to kthread 0.
The allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators.
We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started/stopped.
ip_vs_start_estimator() is changed to report errors
which allows to safely store the estimators in
allocated structures.
Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.
[1] Report from Yunhong Jiang:
https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2]
https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Tested-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Use the provided u64_stats_t type to avoid
load/store tearing.
Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Tested-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Move alloc_percpu/free_percpu logic in new functions
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Cc: "dust.li" <dust.li@linux.alibaba.com>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Eduard Zingerman says:
====================
This patch-set consists of a series of bug fixes for register ID
tracking in verifier.c:states_equal()/regsafe() functions:
- for registers of type PTR_TO_MAP_{KEY,VALUE}, PTR_TO_PACKET[_META]
the regsafe() should call check_ids() even if registers are
byte-to-byte equal;
- states_equal() must maintain idmap that covers all function frames
in the state because functions like mark_ptr_or_null_regs() operate
on all registers in the state;
- regsafe() must compare spin lock ids for PTR_TO_MAP_VALUE registers.
The last point covers issue reported by Kumar Kartikeya Dwivedi in [1],
I borrowed the test commit from there.
Note, that there is also an issue with register id tracking for
scalars described here [2], it would be addressed separately.
[1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/
[2] https://lore.kernel.org/bpf/20221128163442.280187-2-eddyz87@gmail.com/
Eduard Zingerman (6):
bpf: regsafe() must not skip check_ids()
selftests/bpf: test cases for regsafe() bug skipping check_id()
bpf: states_equal() must build idmap for all function frames
selftests/bpf: verify states_equal() maintains idmap across all frames
bpf: use check_ids() for active_lock comparison
selftests/bpf: test case for relaxed prunning of active_lock.id
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|