Age | Commit message (Collapse) | Author |
|
A discussion of the overall problem is below.
As mentioned in patch 0001, the steps are to fix the problem are:
1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
Overview
========
Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time,
the API to achieve that was get_user_pages ("GUP") and its variations.
However, GUP has critical limitations that have been overlooked; in
particular, GUP does not interact correctly with filesystems in all
situations. That means that file-backed memory + GUP is a recipe for
potential problems, some of which have already occurred in the field.
GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
code to get the struct page behind a virtual address and to let storage
hardware perform a direct copy to or from that page. This is a
short-lived access pattern, and as such, the window for a concurrent
writeback of GUP'd page was small enough that there were not (we think)
any reported problems. Also, userspace was expected to understand and
accept that Direct IO was not synchronized with memory-mapped access to
that data, nor with any process address space changes such as munmap(),
mremap(), etc.
Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning
makes an underlying design problem more obvious.
In fact, there are a number of key problems inherent to GUP:
Interactions with file systems
==============================
File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
Long term GUP
=============
Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can
lead to filesystem corruption. What happens is that when a file-backed
page is being written back, it is first mapped read-only in all of the CPU
page tables; the file system then assumes that nobody can write to the
page, and that the page content is therefore stable. Unfortunately, the
GUP callers generally do not monitor changes to the CPU pages tables; they
instead assume that the following pattern is safe (it's not):
get_user_pages()
Hardware can keep a reference to those pages for a very long time,
and write to it at any time. Because "hardware" here means "devices
that are not a CPU", this activity occurs without any interaction with
the kernel's file system code.
for each page
set_page_dirty
put_page()
In fact, the GUP documentation even recommends that pattern.
Anyway, the file system assumes that the page is stable (nothing is
writing to the page), and that is a problem: stable page content is
necessary for many filesystem actions during writeback, such as checksum,
encryption, RAID striping, etc. Furthermore, filesystem features like COW
(copy on write) or snapshot also rely on being able to use a new page for
as memory for that memory range inside the file.
Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a
bounce page to write stable data to the filesystem. The filesystem would
work on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but
note that it is only part of the overall solution, because other problems
remain.
Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be
to require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.
Direct IO
=========
Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file. The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place. Here, Direct IO races with a write back: it calls GUP
before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents. However, it's still a real problem. The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages. That way, GUP will
properly synchronize with the active write back. This does not change the
required GUP behavior, it just avoids that race.
Details
=======
Introduces put_user_page(), which simply calls put_page(). This provides
a way to update all get_user_pages*() callers, so that they call
put_user_page(), instead of put_page().
Also introduces put_user_pages(), and a few dirty/locked variations, as a
replacement for release_pages(), and also as a replacement for open-coded
loops that release multiple pages. These may be used for subsequent
performance improvements, via batching of pages to be released.
This is the first step of fixing a problem (also described in [1] and [2])
with interactions between get_user_pages ("gup") and filesystems.
Problem description: let's start with a bug report. Below, is what
happens sometimes, under memory pressure, when a driver pins some pages
via gup, and then marks those pages dirty, and releases them. Note that
the gup documentation actually recommends that pattern. The problem is
that the filesystem may do a writeback while the pages were gup-pinned,
and then the filesystem believes that the pages are clean. So, when the
driver later marks the pages as dirty, that conflicts with the
filesystem's page tracking and results in a BUG(), like this one that I
experienced:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on
every write access to a clean file backed page, not just the first
one. How long the GUP reference lasts is irrelevant, if the page is
clean and you need to dirty it, you must call ->page_mkwrite before it
is marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
The steps are to fix it are:
1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs]
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On systems without CONTIG_ALLOC activated but that support gigantic pages,
boottime reserved gigantic pages can not be freed at all. This patch
simply enables the possibility to hand back those pages to memory
allocator.
Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: David S. Miller <davem@davemloft.net> [sparc]
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This condition allows to define alloc_contig_range, so simplify it into a
more accurate naming.
Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks,
group them together.
Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
counts for those.
Replace callsites where the bitmask is simple enough with direct
lruvec_page_state() calls.
This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
make that function private again, too.
Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Instead of adding up the zone counters, use lruvec_page_state() to get the
node state directly. This is a bit cheaper and more stream-lined.
Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: memcontrol: clean up the LRU counts tracking".
The memcg LRU stats usage is currently a bit messy. Memcg has private
per-zone counters because reclaim needs zone granularity sometimes, but we
also have plenty of users that need to awkwardly sum them up to node or
memcg granularity. Meanwhile the canonical per-memcg vmstats do not track
the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
This series enables LRU count tracking in the per-memcg vmstats array such
that lruvec_page_state() and memcg_page_state() work on the enum
node_stat_item items for the LRU counters. Then it converts all the
callers that don't specifically need per-zone numbers over to that.
This patch (of 6):
The memcg code currently maintains private per-zone breakdowns of the LRU
counters. This is necessary for reclaim decisions which are still
zone-based, but there are a variety of users of these counters that only
want the aggregate per-lruvec or per-memcg LRU counts, and they need to
painfully sum up the zone counters on each request for that.
These would be better served using the memcg vmstats arrays, which track
VM statistics at the desired scope already. They just don't have the LRU
counts right now.
So to kick off the conversion, begin tracking LRU counts in those.
Link: http://lkml.kernel.org/r/20190228163020.24100-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page alloc fast path it may perform node reclaim, which may cause a
latency spike. We should add tracepoint for this event, and also measure
the latency it causes.
So bellow two tracepoints are introduced,
mm_vmscan_node_reclaim_begin
mm_vmscan_node_reclaim_end
Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: <shaoyafang@didiglobal.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_COMPACTION is set
Only mm_compaction_isolate_{free, migrate}pages may be used when
CONFIG_COMPACTION is not set. All others are used only when
CONFIG_COMPACTION is set.
After this change, if CONFIG_COMPACTION is not set, the tracepoints that
only work when CONFIG_COMPACTION is set will not be exposed to userspace.
Without this change, they will always be exposed in debugfs whether
CONFIG_COMPACTION is set or not. This is an improvement.
Link: http://lkml.kernel.org/r/1552440403-11780-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Showing the gfp flag names instead of the gfp_mask makes trace more
convenient.
Link: http://lkml.kernel.org/r/1552527998-13162-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.
This patch does not change any functionality. New functionality will
follow in subsequent patches.
Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.
NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted. This breaks the current
GUP call site convention of having the returned pages be the final
parameter. So the suggestion was rejected.
Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.
Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]
In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.
[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.
Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.
Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.
This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".
FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.
NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.
As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.
In review[1] it was asked:
<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?
A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.
Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
</quote>
[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We know which LRU is not active.
[chris@chrisdown.name: fix build on !CONFIG_MEMCG]
Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name
Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: Generalize putback functions"]
putback_inactive_pages() and move_active_pages_to_lru() are almost
similar, so this patchset merges them ina single function.
This patch (of 4):
The patch moves the calculation from putback_inactive_pages() to
shrink_inactive_list(). This makes putback_inactive_pages() looking more
similar to move_active_pages_to_lru().
To do that, we account activated pages in reclaim_stat::nr_activate.
Since a page may change its LRU type from anon to file cache inside
shrink_page_list() (see ClearPageSwapBacked()), we have to account pages
for the both types. So, nr_activate becomes an array.
Previously we used nr_activate to account PGACTIVATE events, but now we
account them into pgactivate variable (since they are about number of
pages in general, not about sum of hpage_nr_pages).
Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.
Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/
[willy@infradead.org: fix swapcache pages]
Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
[kirill@shutemov.name: hugetlb stores pages in page cache differently]
Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kirill Shutemov <kirill@shutemov.name>
Reviewed-and-tested-by: Song Liu <songliubraving@fb.com>
Tested-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It is not clear how the zone id is useful in kswapd tracepoints and the id
itself is not really easy to process because it depends on the
configuration (available zones). Let's drop the id for now. If somebody
really needs that information then the zone name should be used instead.
[mhocko@suse.com: new changelog]
Link: http://lkml.kernel.org/r/1552451813-10833-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We now use the slab_list list_head instead of the lru list_head. This
comment has become stale.
Remove stale comment from page struct slab_list list_head.
Link: http://lkml.kernel.org/r/20190402230545.2929-8-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm: Use slab_list list_head instead of lru", v5.
Currently the slab allocators (ab)use the struct page 'lru' list_head. We
have a list head for slab allocators to use, 'slab_list'.
During v2 it was noted by Christoph that the SLOB allocator was reaching
into a list_head, this version adds 2 patches to the front of the set to
fix that.
Clean up all three allocators by using the 'slab_list' list_head instead
of overloading the 'lru' list_head.
This patch (of 7):
Currently if we wish to rotate a list until a specific item is at the
front of the list we can call list_move_tail(head, list). Note that the
arguments are the reverse way to the usual use of list_move_tail(list,
head). This is a hack, it depends on the developer knowing how the
list_head operates internally which violates the layer of abstraction
offered by the list_head. Also, it is not intuitive so the next developer
to come along must study list.h in order to fully understand what is meant
by the call, while this is 'good for' the developer it makes reading the
code harder. We should have an function appropriately named that does
this if there are users for it intree.
By grep'ing the tree for list_move_tail() and list_tail() and attempting
to guess the argument order from the names it seems there is only one
place currently in the tree that does this - the slob allocatator.
Add function list_rotate_to_front() to rotate a list until the specified
item is at the front of the list.
Link: http://lkml.kernel.org/r/20190402230545.2929-2-tobin@kernel.org
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
addresses
Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
pmdp_set_access_flags(). That helper enforces a pmd aligned @address
argument via VM_BUG_ON() assertion.
Update the implementation to take a 'struct vm_fault' argument directly
and apply the address alignment fixup internally to fix crash signatures
like:
kernel BUG at arch/x86/mm/pgtable.c:515!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1
[..]
RIP: 0010:pmdp_set_access_flags+0x48/0x50
[..]
Call Trace:
vmf_insert_pfn_pmd+0x198/0x350
dax_iomap_fault+0xe82/0x1190
ext4_dax_huge_fault+0x103/0x1f0
? __switch_to_asm+0x40/0x70
__handle_mm_fault+0x3f6/0x1370
? __switch_to_asm+0x34/0x70
? __switch_to_asm+0x40/0x70
handle_mm_fault+0xda/0x200
__do_page_fault+0x249/0x4f0
do_page_fault+0x32/0x110
? page_fault+0x8/0x30
page_fault+0x1e/0x30
Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Piotr Balcer <piotr.balcer@intel.com>
Tested-by: Yan Ma <yan.ma@intel.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Chandan Rajendra <chandan@linux.ibm.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Add more caching controls for userspace filesystems to use, as well as
bug fixes and cleanups"
* tag 'fuse-update-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up fuse_alloc_inode
fuse: Add ioctl flag for x32 compat ioctl
fuse: Convert fusectl to use the new mount API
fuse: fix changelog entry for protocol 7.9
fuse: fix changelog entry for protocol 7.12
fuse: document fuse_fsync_in.fsync_flags
fuse: Add FOPEN_STREAM to use stream_open()
fuse: require /dev/fuse reads to have enough buffer capacity
fuse: retrieve: cap requested size to negotiated max_write
fuse: allow filesystems to have precise control over data cache
fuse: convert printk -> pr_*
fuse: honor RLIMIT_FSIZE in fuse_file_fallocate
fuse: fix writepages on 32bit
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"Another round of various bug fixes came in. Damien improved SMR drive
support a bit, and Chao replaced BUG_ON() with reporting errors to
user since we've not hit from users but did hit from crafted images.
We've found a disk layout bug in large_nat_bits feature which supports
very large NAT entries enabled at mkfs. If the feature is enabled, it
will give a notice to run fsck to correct the on-disk layout.
Enhancements:
- reduce memory consumption for SMR drive
- better discard handling for multiple partitions
- tracepoints for f2fs_file_write_iter/f2fs_filemap_fault
- allow to change CP_CHKSUM_OFFSET
- detect wrong layout of large_nat_bitmap feature
- enhance checking valid data indices
Bug fixes:
- Multiple partition support for SMR drive
- deadlock problem in f2fs_balance_fs_bg
- add boundary checks to fix abnormal behaviors on fuzzed images
- inline_xattr space calculations
- replace f2fs_bug_on with errors
In addition, this series contains various memory boundary check and
sanity check of on-disk consistency"
* tag 'f2fs-for-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
f2fs: fix to avoid accessing xattr across the boundary
f2fs: fix to avoid potential race on sbi->unusable_block_count access/update
f2fs: add tracepoint for f2fs_filemap_fault()
f2fs: introduce DATA_GENERIC_ENHANCE
f2fs: fix to handle error in f2fs_disable_checkpoint()
f2fs: remove redundant check in f2fs_file_write_iter()
f2fs: fix to be aware of readonly device in write_checkpoint()
f2fs: fix to skip recovery on readonly device
f2fs: fix to consider multiple device for readonly check
f2fs: relocate chksum_offset for large_nat_bitmap feature
f2fs: allow unfixed f2fs_checkpoint.checksum_offset
f2fs: Replace spaces with tab
f2fs: insert space before the open parenthesis '('
f2fs: allow address pointer number of dnode aligning to specified size
f2fs: introduce f2fs_read_single_page() for cleanup
f2fs: mark is_extension_exist() inline
f2fs: fix to set FI_UPDATE_WRITE correctly
f2fs: fix to avoid panic in f2fs_inplace_write_data()
f2fs: fix to do sanity check on valid block count of segment
f2fs: fix to do sanity check on valid node/block count
...
|
|
Fix typos in enumeration names for nvme status:
s/ACIVATE/ACTIVATE/
s/INSUFFICENT/INSUFFICIENT/
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MDS mitigations from Thomas Gleixner:
"Microarchitectural Data Sampling (MDS) is a hardware vulnerability
which allows unprivileged speculative access to data which is
available in various CPU internal buffers. This new set of misfeatures
has the following CVEs assigned:
CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory
MDS attacks target microarchitectural buffers which speculatively
forward data under certain conditions. Disclosure gadgets can expose
this data via cache side channels.
Contrary to other speculation based vulnerabilities the MDS
vulnerability does not allow the attacker to control the memory target
address. As a consequence the attacks are purely sampling based, but
as demonstrated with the TLBleed attack samples can be postprocessed
successfully.
The mitigation is to flush the microarchitectural buffers on return to
user space and before entering a VM. It's bolted on the VERW
instruction and requires a microcode update. As some of the attacks
exploit data structures shared between hyperthreads, full protection
requires to disable hyperthreading. The kernel does not do that by
default to avoid breaking unattended updates.
The mitigation set comes with documentation for administrators and a
deeper technical view"
* 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/speculation/mds: Fix documentation typo
Documentation: Correct the possible MDS sysfs values
x86/mds: Add MDSUM variant to the MDS documentation
x86/speculation/mds: Add 'mitigations=' support for MDS
x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
x86/speculation/mds: Fix comment
x86/speculation/mds: Add SMT warning message
x86/speculation: Move arch_smt_update() call to after mitigation decisions
x86/speculation/mds: Add mds=full,nosmt cmdline option
Documentation: Add MDS vulnerability documentation
Documentation: Move L1TF to separate directory
x86/speculation/mds: Add mitigation mode VMWERV
x86/speculation/mds: Add sysfs reporting for MDS
x86/speculation/mds: Add mitigation control for MDS
x86/speculation/mds: Conditionally clear CPU buffers on idle entry
x86/kvm/vmx: Add MDS protection when L1D Flush is not active
x86/speculation/mds: Clear CPU buffers on exit to user
x86/speculation/mds: Add mds_clear_cpu_buffers()
x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
x86/speculation/mds: Add BUG_MSBDS_ONLY
...
|
|
thermal_of_cooling_device_register() and thermal_cooling_device_register()
are typically called from driver probe functions, and
thermal_cooling_device_unregister() is called from remove functions. This
makes both a perfect candidate for device managed functions.
Introduce devm_thermal_of_cooling_device_register(). This function can
also be used to replace thermal_cooling_device_register() by passing a NULL
pointer as device node. The new function requires both struct device *
and struct device_node * as parameters since the struct device_node *
parameter is not always identical to dev->of_node.
Don't introduce a device managed remove function since it is not needed
at this point.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
|
|
Mark completion EQs as shared resources so that they can be used by CQs
with uid != 0.
Fixes: 7efce3691d33 ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
'VAL' should be protected by the brackets.
v2:
* Squash the fix for Documentation/bpf/btf.rst
Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Add fwnode support to the lm3630a driver and optionally allow
configuring the label, default brightness level, and maximum brightness
level. The two outputs can be controlled by bank A and B independently
or bank A can control both outputs.
If the platform data was not configured, then the driver defaults to
enabling both banks. This patch changes the default value to disable
both banks before parsing the firmware node so that just a single bank
can be enabled if desired. There are no in-tree users of this driver.
Driver was tested on a LG Nexus 5 (hammerhead) phone.
Signed-off-by: Brian Masney <masneyb@onstation.org>
Reviewed-by: Dan Murphy <dmurphy@ti.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Support Touchpad MCU as a special of CrOS EC devices. The current
Touchpad MCU is used on Eve Chromebook and used the same protocol as
other CrOS EC devices.
When a MCU has touchpad support (aka EC_FEATURE_TOUCHPAD), it is
instantiated as a special CrOS EC device with device name 'cros_tp'. So
regardless of the probing order between the actual cros_ec and cros_tp,
the userspace and other kernel drivers should not confuse them.
Signed-off-by: Wei-Ning Huang <wnhuang@google.com>
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Support Fingerprint MCU as a special of CrOS EC devices. The current FP
MCU uses the same EC SPI protocol v3 as other CrOS EC devices on a SPI
bus.
When a MCU has fingerprint support (aka EC_FEATURE_FINGERPRINT), it is
instantiated as a special CrOS EC device with device name 'cros_fp'. So
regardless of the probing order between the actual cros_ec and cros_fp,
the userspace and other kernel drivers should not confuse them.
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Update the feature enum for the Chromebook Embedded Controller to the
latest version. Some of these enums are still not used in the kernel but
we might be also interested on have these enums up to date. Userspace
can use them to query the features to the EC via the cros-ec character
device.
While here, also fix a typo in one comment in the enum.
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Add macros to define masks and bits for imx6sx MQS registers
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Mismatch between what is found in the Datasheets for DA9063 and DA9063L
provided by Dialog Semiconductor, and the register names provided in the
MFD registers file. The changes are for the OTP (one-time-programming)
control registers. The two naming errors are OPT instead of OTP, and
COUNT instead of CONT (i.e. control).
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Add support in code for the new forms of the host sleep event.
Detects the presence of this version of the command at runtime,
and use whichever form the EC supports. At this time, always
request the default timeout, and only report the failing response
via a WARN_ONCE(). Future versions could accept the sleep parameter
from outside the driver, and return the response information to
usermode or elsewhere.
Signed-off-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Introduce the command and response structures for the second revision
of the host sleep event. These structures are part of a new EC change
that enables detection of failure to enter S0ix. The EC waits a
kernel-specified timeout (or a default amount of time) for the S0_SLP
pin to change, and wakes the system if that change does not occur in
time.
Signed-off-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Adopt the SPDX license identifiers to ease license compliance
management.
Signed-off-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Covnert the headers of the source and include files to SPDX.
And fix some typos in the descriptions ("interrupt" instead of "I2C").
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Steve Twiss <stwiss.opensource@diasemi.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
Integrated Sensor Hub (ISH) is also a MCU running EC
having feature bit EC_FEATURE_ISH. Instantiate it as
a special CrOS EC device with device name 'cros_ish'.
Signed-off-by: Rushikesh S Kadam <rushikesh.s.kadam@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
|
|
'ib-mfd-pinctrl-5.2-2' and 'ib-mfd-regulator-5.2', tag 'ib-mfd-arm-net-5.2' into ibs-for-mfd-merged
Immutable branch between MFD, ARM and Net due for the 5.2 merge window
|
|
Add "nvidia,gpu-throt-level" property to set gpu hw
throttle level.
Signed-off-by: Wei Ni <wni@nvidia.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
|
|
.dumpit() callback is used for returning same type of data in the loop,
e.g. loop over ports, resources, devices.
However system parameters are general and standalone for whole
subsystem. It means that getting system parameters should be doit
callback.
Fixes: cb7e0e130503 ("RDMA/core: Add interface to read device namespace sharing mode")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
- Cleanup PCI register definitions, typos, etc (Bjorn Helgaas)
- Remove unnecessary use of user-space types in CPER (Bjorn Helgaas)
- Cleanup setup-bus.c comments & whitespace (Nicholas Johnson)
* pci/trivial:
PCI: Cleanup setup-bus.c comments and whitespace
CPER: Remove unnecessary use of user-space types
CPER: Add UEFI spec references
PCI: Fix comment typos
PCI: Cleanup register definition width and whitespace
# Conflicts:
# drivers/pci/pci.c
# drivers/pci/setup-bus.c
|
|
- Add list of legal DMA address ranges to PCI host bridge (Srinath
Mannam)
- Reserve inaccessible DMA ranges so IOMMU doesn't allocate them (Srinath
Mannam)
- Parse iProc DT dma-ranges to learn what PCI devices can reach via DMA
(Srinath Mannam)
* pci/iova-dma-ranges:
PCI: iproc: Add sorted dma ranges resource entries to host bridge
iommu/dma: Reserve IOVA for PCIe inaccessible DMA address
PCI: Add dma_ranges window list
# Conflicts:
# drivers/pci/probe.c
|
|
- Move IRQ register address computation inside macros (Kishon Vijay
Abraham I)
- Separate legacy IRQ and MSI configuration (Kishon Vijay Abraham I)
- Use hwirq, not virq, to get MSI IRQ number offset (Kishon Vijay Abraham
I)
- Squash ks_pcie_handle_msi_irq() into ks_pcie_msi_irq_handler() (Kishon
Vijay Abraham I)
- Add dwc support for platforms with custom MSI controllers (Kishon Vijay
Abraham I)
- Add keystone-specific MSI controller (Kishon Vijay Abraham I)
- Remove dwc host_ops previously used for keystone-specific MSI (Kishon
Vijay Abraham I)
- Skip dwc default MSI init if platform has custom MSI controller (Kishon
Vijay Abraham I)
- Implement .start_link() and .stop_link() for keystone endpoint support
(Kishon Vijay Abraham I)
- Add keystone "reg-names" DT binding (Kishon Vijay Abraham I)
- Squash ks_pcie_dw_host_init() into ks_pcie_add_pcie_port() (Kishon
Vijay Abraham I)
- Get keystone register resources from DT by name, not index (Kishon
Vijay Abraham I)
- Get DT resources in .probe() to prepare for endpoint support (Kishon
Vijay Abraham I)
- Add "ti,syscon-pcie-mode" DT property for PCIe mode configuration
(Kishon Vijay Abraham I)
- Explicitly set keystone to host mode (Kishon Vijay Abraham I)
- Document DT "atu" reg-names requirement for DesignWare core >= 4.80
(Kishon Vijay Abraham I)
- Enable dwc iATU unroll for endpoint mode as well as host mode (Kishon
Vijay Abraham I)
- Add dwc "version" to identify core >= 4.80 for ATU programming (Kishon
Vijay Abraham I)
- Don't build ARM32-specific keystone code on ARM64 (Kishon Vijay Abraham
I)
- Add DT binding for keystone PCIe RC in AM654 SoC (Kishon Vijay Abraham
I)
- Add keystone support for AM654 SoC PCIe RC (Kishon Vijay Abraham I)
- Reset keystone PHYs before enabling them (Kishon Vijay Abraham I)
- Make of_pci_get_max_link_speed() available to endpoint drivers as well
as host drivers (Kishon Vijay Abraham I)
- Add keystone support for DT "max-link-speed" property (Kishon Vijay
Abraham I)
- Add endpoint library support for BAR buffer alignment (Kishon Vijay
Abraham I)
- Make all dw_pcie_ep_ops structs const (Kishon Vijay Abraham I)
- Fix fencepost error in dw_pcie_ep_find_capability() (Kishon Vijay
Abraham I)
- Add dwc hooks for dbi/dbi2 that share the same address space (Kishon
Vijay Abraham I)
- Add keystone support for TI AM654x in endpoint mode (Kishon Vijay
Abraham I)
- Configure designware endpoints to advertise smallest resizable BAR
(1MB) (Kishon Vijay Abraham I)
- Align designware endpoint ATU windows for raising MSIs (Kishon Vijay
Abraham I)
- Add endpoint test support for TI AM654x (Kishon Vijay Abraham I)
- Fix endpoint test test_reg_bar issue (Kishon Vijay Abraham I)
* remotes/lorenzo/pci/keystone:
misc: pci_endpoint_test: Fix test_reg_bar to be updated in pci_endpoint_test
misc: pci_endpoint_test: Add support to test PCI EP in AM654x
PCI: designware-ep: Use aligned ATU window for raising MSI interrupts
PCI: designware-ep: Configure Resizable BAR cap to advertise the smallest size
PCI: keystone: Add support for PCIe EP in AM654x Platforms
dt-bindings: PCI: Add PCI EP DT binding documentation for AM654
PCI: dwc: Add callbacks for accessing dbi2 address space
PCI: dwc: Fix dw_pcie_ep_find_capability() to return correct capability offset
PCI: dwc: Add const qualifier to struct dw_pcie_ep_ops
PCI: endpoint: Add support to specify alignment for buffers allocated to BARs
PCI: keystone: Add support to set the max link speed from DT
PCI: OF: Allow of_pci_get_max_link_speed() to be used by PCI Endpoint drivers
PCI: keystone: Invoke phy_reset() API before enabling PHY
PCI: keystone: Add support for PCIe RC in AM654x Platforms
dt-bindings: PCI: Add PCI RC DT binding documentation for AM654
PCI: keystone: Prevent ARM32 specific code to be compiled for ARM64
PCI: dwc: Fix ATU identification for designware version >= 4.80
PCI: dwc: Enable iATU unroll for endpoint too
dt-bindings: PCI: Document "atu" reg-names
PCI: keystone: Explicitly set the PCIe mode
dt-bindings: PCI: Add dt-binding to configure PCIe mode
PCI: keystone: Move resources initialization to prepare for EP support
PCI: keystone: Use platform_get_resource_byname() to get memory resources
PCI: keystone: Perform host initialization in a single function
dt-bindings: PCI: keystone: Add "reg-names" binding information
PCI: keystone: Cleanup error_irq configuration
PCI: keystone: Add start_link()/stop_link() dw_pcie_ops
PCI: dwc: Remove default MSI initialization for platform specific MSI chips
PCI: dwc: Remove Keystone specific dw_pcie_host_ops
PCI: keystone: Use Keystone specific msi_irq_chip
PCI: dwc: Add support to use non default msi_irq_chip
PCI: keystone: Cleanup ks_pcie_msi_irq_handler()
PCI: keystone: Use hwirq to get the MSI IRQ number offset
PCI: keystone: Add separate functions for configuring MSI and legacy interrupt
PCI: keystone: Cleanup interrupt related macros
# Conflicts:
# drivers/pci/controller/dwc/pcie-designware.h
|
|
- Add Amazon Annapurna Labs PCIe host controller driver (Jonathan
Chocron)
* pci/host/al:
PCI: al: Add Amazon Annapurna Labs PCIe host controller driver
|
|
- Support all 255 PFF ports in switchtec driver (Wesley Sheng)
- Fix unintentional switchtec MRPC event masking that degraded firmware
update speed (Wesley Sheng)
* pci/switchtec:
switchtec: Fix unintended mask of MRPC event
switchtec: Increase PFF limit from 48 to 255
|
|
- Mark expected switch fall-throughs (Gustavo A. R. Silva)
- Remove unused pci_request_region_exclusive() (Johannes Thumshirn)
- Fix x86 PCI IRQ routing table memory leak (Wenwen Wang)
- Reset Lenovo ThinkPad P50 if firmware didn't do it on reboot (Lyude
Paul)
- Add and use pci_dev_id() helper to simplify PCI_DEVID() usage (touches
several places outside drivers/pci/) (Heiner Kallweit)
- Transition Mobiveil PCI maintenance to Karthikeyan M and Hou Zhiqiang
(Subrahmanya Lingappa)
* pci/misc:
MAINTAINERS: Add Karthikeyan Mitran and Hou Zhiqiang for Mobiveil PCI
platform/chrome: chromeos_laptop: use pci_dev_id() helper
stmmac: pci: Use pci_dev_id() helper
iommu/vt-d: Use pci_dev_id() helper
iommu/amd: Use pci_dev_id() helper
drm/amdkfd: Use pci_dev_id() helper
powerpc/powernv/npu: Use pci_dev_id() helper
r8169: use pci_dev_id() helper
PCI: Add pci_dev_id() helper
PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary
x86/PCI: Fix PCI IRQ routing table memory leak
PCI: Remove unused pci_request_region_exclusive()
PCI: Mark expected switch fall-throughs
|
|
- Remove unused mask_msi_irq(), unmask_msi_irq(), write_msi_msg(),
__write_msi_msg() (Bjorn Helgaas)
* pci/msi:
PCI/MSI: Remove unused mask_msi_irq() and unmask_msi_irq()
PCI/MSI: Remove unused __write_msi_msg() and write_msi_msg()
|
|
- Fix RPA and RPA DLPAR refcount issues (Tyrel Datwyler)
- Stop exporting pci_get_hp_params() (Alexandru Gagniuc)
- Simplify _HPP, _HPX parsing (Alexandru Gagniuc)
- Add support for _HPX Type 3 settings (Alexandru Gagniuc)
- Tell firmware we support _HPX Type 3 via _OSC (Alexandru Gagniuc)
* pci/hotplug:
PCI/ACPI: Advertise _HPX Type 3 support via _OSC
PCI/ACPI: Implement _HPX Type 3 Setting Record
PCI/ACPI: Remove the need for 'struct hotplug_params'
PCI/ACPI: Do not export pci_get_hp_params()
PCI: rpaphp: Get/put device node reference during slot alloc/dealloc
PCI: rpadlpar: Fix leaked device_node references in add/remove paths
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
- scan hint update which helps address performance issues with heavily
fragmented blocks
- lockdep fix when freeing an allocation causes balance work to be
scheduled
* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: remove spurious lock dependency between percpu and sched
percpu: use chunk scan_hint to skip some scanning
percpu: convert chunk hints to be based on pcpu_block_md
percpu: make pcpu_block_md generic
percpu: use block scan_hint to only scan forward
percpu: remember largest area skipped during allocation
percpu: add block level scan_hint
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
percpu: relegate chunks unusable when failing small allocations
percpu: manage chunks based on contig_bits instead of free_bytes
percpu: introduce helper to determine if two regions overlap
percpu: do not search past bitmap when allocating an area
percpu: update free path with correct new free region
|