Age | Commit message (Collapse) | Author |
|
Currently ZONE_DEVICE page reference counts are initialised by core memory
management code in __init_zone_device_page() as part of the memremap()
call which driver modules make to obtain ZONE_DEVICE pages. This
initialises page refcounts to 1 before returning them to the driver.
This was presumably done because it drivers had a reference of sorts on
the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a zero
refcount), freeing drivers of manipulating page reference counts.
However it complicates figuring out whether or not a page is free from the
mm perspective because it is no longer possible to just look at the
refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.
To simplify this it is desirable to remove the page reference count for
the driver, so core mm can just use the refcount without having to account
for page type or do other types of tracking. This is possible because
drivers can always assume the page is valid as core kernel will never
offline or remove the struct page.
This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that requires
a non-zero reference count when initialising the page so set that when the
page is first mapped.
Link: https://lkml.kernel.org/r/6aedb0ac2886dcc4503cb705273db5b3863a0b66.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the check
in __gup_device_huge() is redundant. Remove it
Link: https://lkml.kernel.org/r/260e3dcfaf05ff1c734a49698ed4332b5dae04c2.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Wiliams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page ->mapping pointer can have magic values like
PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON for page owner specific
usage. Currently PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON alias to
the same value. This isn't a problem because FS DAX pages are never seen
by the anonymous mapping code and vice versa.
However a future change will make FS DAX pages more like normal pages, so
folio_test_anon() must not return true for a FS DAX page.
We could explicitly test for a FS DAX page in folio_test_anon(), etc.
however the PAGE_MAPPING_DAX_SHARED flag isn't actually needed. Instead
we can use the page->mapping field to implicitly track the first mapping
of a page. If page->mapping is non-NULL it implies the page is associated
with a single mapping at page->index. If the page is associated with a
second mapping clear page->mapping and set page->share to 1.
This is possible because a shared mapping implies the file-system
implements dax_holder_operations which makes the ->mapping and ->index,
which is a union with ->share, unused.
The page is considered shared when page->mapping == NULL and page->share >
0 or page->mapping != NULL, implying it is present in at least one address
space. This also makes it easier for a future change to detect when a
page is first mapped into an address space which requires special
handling.
Link: https://lkml.kernel.org/r/c22f699202db0acee2f7039eb026e68261ce42d6.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Wiliams <dan.j.williams@intel.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
File systems call dax_break_mapping() prior to reallocating file system
blocks to ensure the page is not undergoing any DMA or other accesses.
Generally this is needed when a file is truncated to ensure that if a
block is reallocated nothing is writing to it. However filesystems
currently don't call this when an FS DAX inode is evicted.
This can cause problems when the file system is unmounted as a page can
continue to be under going DMA or other remote access after unmount. This
means if the file system is remounted any truncate or other operation
which requires the underlying file system block to be freed will not wait
for the remote access to complete. Therefore a busy block may be
reallocated to a new file leading to corruption.
Link: https://lkml.kernel.org/r/2d3cf575bbd095084993154be2f0aa7442e5cd28.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Wiliams <dan.j.williams@intel.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prior to any truncation operations file systems call dax_break_mapping()
to ensure pages in the range are not under going DMA. Later DAX
page-cache entries will be removed by truncate_folio_batch_exceptionals()
in the generic page-cache code.
However this makes it possible for folios to be removed from the
page-cache even though they are still DMA busy if the file-system hasn't
called dax_break_mapping(). It also means they can never be waited on in
future because FS DAX will lose track of them once the page-cache entry
has been deleted.
Instead it is better to delete the FS DAX entry when the file-system calls
dax_break_mapping() as part of it's truncate operation. This ensures only
idle pages can be removed from the FS DAX page-cache and makes it easy to
detect if a file-system hasn't called dax_break_mapping() prior to a
truncate operation.
Link: https://lkml.kernel.org/r/3be6115eaaa8d28fee37fcba3287be4f226a7d24.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prior to freeing a block file systems supporting FS DAX must check that
the associated pages are both unmapped from user-space and not undergoing
DMA or other access from eg. get_user_pages(). This is achieved by
unmapping the file range and scanning the FS DAX page-cache to see if any
pages within the mapping have an elevated refcount.
This is done using two functions - dax_layout_busy_page_range() which
returns a page to wait for the refcount to become idle on. Rather than
open-code this introduce a common implementation to both unmap and wait
for the page to become idle.
Link: https://lkml.kernel.org/r/c4d381e41fc618296cee2820403c166d80599d5c.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A FS DAX page is considered idle when its refcount drops to one. This is
currently open-coded in all file systems supporting FS DAX. Move the idle
detection to a common function to make future changes easier.
Link: https://lkml.kernel.org/r/c2c9d269110b90224eeb1dc661ffbc1d82aa20c9.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Several functions internal to FS DAX use the following pattern when trying
to obtain an unlocked entry:
xas_for_each(&xas, entry, end_idx) {
if (dax_is_locked(entry))
entry = get_unlocked_entry(&xas, 0);
This is problematic because get_unlocked_entry() will get the next present
entry in the range, and the next entry may not be locked. Therefore any
processing of the original locked entry will be skipped. This can cause
dax_layout_busy_page_range() to miss DMA-busy pages in the range, leading
file systems to free blocks whilst DMA operations are ongoing which can
lead to file system corruption.
Instead callers from within a xas_for_each() loop should be waiting for
the current entry to be unlocked without advancing the XArray state so a
new function is introduced to wait.
Also while we are here rename get_unlocked_entry() to
get_next_unlocked_entry() to make it clear that it may advance the
iterator state.
Link: https://lkml.kernel.org/r/b11b2baed7157dc900bf07a4571bf71b7cd82d97.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
dax_layout_busy_page_range() is used by file systems to scan the DAX
page-cache to unmap mapping pages from user-space and to determine if any
pages in the given range are busy, either due to ongoing DMA or other
get_user_pages() usage.
Currently it checks to see the file mapping is mapped into user-space with
mapping_mapped() and returns early if not, skipping the check for DMA busy
pages. This is wrong as pages may still be undergoing DMA access even if
they have subsequently been unmapped from user-space. Fix this by
dropping the check for mapping_mapped().
Link: https://lkml.kernel.org/r/d85ce6c2d1400ff111ed7302d9eef223d0243c57.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fs/dax: Fix ZONE_DEVICE page reference counts", v9.
Device and FS DAX pages have always maintained their own page reference
counts without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.
Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS DAX
pages need their own reference counting scheme.
By treating the refcounts on these pages the same way as normal pages we
can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change here.
It also frees up a valuable SW define PTE bit on architectures that have
devmap PTE bits defined.
It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also enables
support for compound ZONE_DEVICE pages which is one of my primary
motivators for doing this work.
This patch (of 20):
FS DAX requires file systems to call into the DAX layout prior to
unlinking inodes to ensure there is no ongoing DMA or other remote access
to the direct mapped page. The fuse file system implements
fuse_dax_break_layouts() to do this which includes a comment indicating
that passing dmap_end == 0 leads to unmapping of the whole file.
However this is not true - passing dmap_end == 0 will not unmap anything
before dmap_start, and further more dax_layout_busy_page_range() will not
scan any of the range to see if there maybe ongoing DMA access to the
range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
which will invalidate the entire file range to
dax_layout_busy_page_range().
Link: https://lkml.kernel.org/r/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/f09a34b6c40032022e4ddee6fadb7cc676f08867.1740713401.git-series.apopple@nvidia.com
Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault path")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Convert from pfn to folio instead of page and use those folios throughout
to avoid accesses to page->index and page->mapping.
Link: https://lkml.kernel.org/r/20241216155408.8102-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Cc: Dan Willaims <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This looks like a complete mess (why are we setting page->index at page
fault time?), but I no longer care about DAX, and there's no reason to let
DAX hold us back from removing page->index.
Link: https://lkml.kernel.org/r/20241216155408.8102-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 79add3a3f795e ("ext4: notify when discard is not supported")
noted that keeping the DISCARD flag is for possibility that the underlying
device might change in future even without file system remount. However,
this scenario has rarely occurred in practice on the device side. Even if
it does occur, it can be resolved with remount. Clearing the DISCARD flag
not only prevents confusion caused by mount options but also avoids
sending unnecessary discard commands.
Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Link: https://patch.msgid.link/20250311021310.669524-1-lidiangang@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since the function jbd2_journal_unfile_buffer() is no longer called
anywhere after commit e5a120aeb57f ("jbd2: remove journal_head from
descriptor buffers"), so let's remove it.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250306063240.157884-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
capable() calls refer to enabled LSMs whether to permit or deny the
request. This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to permit
the task the requested capability, while it does not need it,
violating the principle of least privilege.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250302160657.127253-2-cgoettsche@seltendoof.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") introduces
the sysfs control interface "mb_max_linear_groups" to address the problem
that rotational devices performance degrades when the "mb_optimize_scan"
feature is enabled, which may result in distant block group allocation.
However, the name of the interface was incorrect in the comment to the
ext4/mballoc.c file, and this patch fixes it, without further changes.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250224012005.689549-1-wozizhi@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In __jbd2_journal_erase(), the block_stop parameter includes the last
block of a contiguous region; however, the calculation of byte_stop is
incorrect, as it does not account for the bytes in that last block.
Consequently, the page cache is not cleared properly, which occasionally
causes the ext4/050 test to fail.
Since block_stop operates on inclusion semantics, it involves repeated
increments and decrements by 1, significantly increasing the complexity
of the calculations. Optimize the calculation and fix the incorrect
byte_stop by make both block_stop and byte_stop to use exclusion
semantics.
This fixes a failure in fstests ext4/050.
Fixes: 01d5d96542fd ("ext4: add discard/zeroout flags to journal flush")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250217065955.3829229-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Buffer heads are attached to folios, not to pages. Also
flush_dcache_page() is now deprecated in favour of flush_dcache_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250213182303.2133205-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Otherwise, if ext4_inode_attach_jinode() fails, a hung task will
happen because filemap_invalidate_unlock() isn't called to unlock
mapping->invalidate_lock. Like this:
EXT4-fs error (device sda) in ext4_setattr:5557: Out of memory
INFO: task fsstress:374 blocked for more than 122 seconds.
Not tainted 6.14.0-rc1-next-20250206-xfstests-dirty #726
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:fsstress state:D stack:0 pid:374 tgid:374 ppid:373
task_flags:0x440140 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x2c9/0x7f0
schedule+0x27/0xa0
schedule_preempt_disabled+0x15/0x30
rwsem_down_read_slowpath+0x278/0x4c0
down_read+0x59/0xb0
page_cache_ra_unbounded+0x65/0x1b0
filemap_get_pages+0x124/0x3e0
filemap_read+0x114/0x3d0
vfs_read+0x297/0x360
ksys_read+0x6c/0xe0
do_syscall_64+0x4b/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: c7fc0366c656 ("ext4: partial zero eof block on unaligned inode size extension")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20250213112247.3168709-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There's issue as follows:
BUG: KASAN: use-after-free in ext4_xattr_inode_dec_ref_all+0x6ff/0x790
Read of size 4 at addr ffff88807b003000 by task syz-executor.0/15172
CPU: 3 PID: 15172 Comm: syz-executor.0
Call Trace:
__dump_stack lib/dump_stack.c:82 [inline]
dump_stack+0xbe/0xfd lib/dump_stack.c:123
print_address_description.constprop.0+0x1e/0x280 mm/kasan/report.c:400
__kasan_report.cold+0x6c/0x84 mm/kasan/report.c:560
kasan_report+0x3a/0x50 mm/kasan/report.c:585
ext4_xattr_inode_dec_ref_all+0x6ff/0x790 fs/ext4/xattr.c:1137
ext4_xattr_delete_inode+0x4c7/0xda0 fs/ext4/xattr.c:2896
ext4_evict_inode+0xb3b/0x1670 fs/ext4/inode.c:323
evict+0x39f/0x880 fs/inode.c:622
iput_final fs/inode.c:1746 [inline]
iput fs/inode.c:1772 [inline]
iput+0x525/0x6c0 fs/inode.c:1758
ext4_orphan_cleanup fs/ext4/super.c:3298 [inline]
ext4_fill_super+0x8c57/0xba40 fs/ext4/super.c:5300
mount_bdev+0x355/0x410 fs/super.c:1446
legacy_get_tree+0xfe/0x220 fs/fs_context.c:611
vfs_get_tree+0x8d/0x2f0 fs/super.c:1576
do_new_mount fs/namespace.c:2983 [inline]
path_mount+0x119a/0x1ad0 fs/namespace.c:3316
do_mount+0xfc/0x110 fs/namespace.c:3329
__do_sys_mount fs/namespace.c:3540 [inline]
__se_sys_mount+0x219/0x2e0 fs/namespace.c:3514
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x67/0xd1
Memory state around the buggy address:
ffff88807b002f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88807b002f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88807b003000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88807b003080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88807b003100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Above issue happens as ext4_xattr_delete_inode() isn't check xattr
is valid if xattr is in inode.
To solve above issue call xattr_check_inode() check if xattr if valid
in inode. In fact, we can directly verify in ext4_iget_extra_inode(),
so that there is no divergent verification.
Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250208063141.1539283-3-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Introduce ITAIL helper to get the bound of xattr in inode.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250208063141.1539283-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the days when SCSI-2 was emerging, some drives did claim SCSI-2 but did
not correctly implement it. The st driver first tries MODE SELECT with the
page format bit set to set the block descriptor. If not successful, the
non-page format is tried.
The test only tests the sense code and this triggers also from illegal
parameter in the parameter list. The test is limited to "old" devices and
made more strict to remove false alarms.
Signed-off-by: Kai Mäkisara <Kai.Makisara@kolumbus.fi>
Link: https://lore.kernel.org/r/20250311112516.5548-4-Kai.Makisara@kolumbus.fi
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The SCSI ERASE command erases from the current position onwards. Don't
clear the position variables.
Signed-off-by: Kai Mäkisara <Kai.Makisara@kolumbus.fi>
Link: https://lore.kernel.org/r/20250311112516.5548-3-Kai.Makisara@kolumbus.fi
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Change the array size to follow parms size instead of a fixed value.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Closes: https://lore.kernel.org/linux-scsi/CALGdzuoubbra4xKOJcsyThdk5Y1BrAmZs==wbqjbkAgmKS39Aw@mail.gmail.com/
Signed-off-by: Kai Mäkisara <Kai.Makisara@kolumbus.fi>
Link: https://lore.kernel.org/r/20250311112516.5548-2-Kai.Makisara@kolumbus.fi
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
When the tcm_loop_nr_hw_queues is set to a value greater than 1, the
tags of requests in the block layer are no longer unique. This may lead
to erroneous aborting of commands with the same tag. The issue can be
resolved by using blk_mq_unique_tag to generate globally unique
identifiers by combining the hardware queue index and per-queue tags.
Fixes: 6375f8908255 ("tcm_loop: Fixup tag handling")
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250313014728.105849-1-kanie@linux.alibaba.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Commit 32566a6f1ae5 ("scsi: lpfc: Remove NLP_RELEASE_RPI flag from nodelist
structure") introduced a regression with SLI-3 adapters (e.g. LPe12000 8Gb)
where a Link Down / Link Up such as caused by disabling an host FC switch
port would result in the devices remaining in the transport-offline state
and multipath reporting them as failed. This problem was not seen with
newer SLI-4 adapters.
The problem was caused by portions of the patch which removed the functions
__lpfc_sli_rpi_release() and lpfc_sli_rpi_release() and all their callers.
This was presumably because with the removal of the NLP_RELEASE_RPI flag
there was no need to free the rpi.
However, __lpfc_sli_rpi_release() and lpfc_sli_rpi_release() which calls it
reset the NLP_UNREG_INP flag. And, lpfc_sli_def_mbox_cmpl() has a path
where __lpfc_sli_rpi_release() was called in a particular case where
NLP_UNREG_INP was not otherwise cleared because of other conditions.
Restoring the else clause of this conditional and simply clearing the
NLP_UNREG_INP flag appears to resolve the problem with SLI-3 adapters. It
should be noted that the code path in question is not specific to SLI-3,
but there are other SLI-4 code paths which may have masked the issue.
Fixes: 32566a6f1ae5 ("scsi: lpfc: Remove NLP_RELEASE_RPI flag from nodelist structure")
Cc: stable@vger.kernel.org
Tested-by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Link: https://lore.kernel.org/r/20250317163731.356873-1-emilne@redhat.com
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
At present, we determine the protocol through the cmd type, but other cmd
types, such as vendor-specific commands, default to the PIO protocol. This
strategy often causes the execution of different vendor-specific commands
to fail. In fact, for these commands, a better way is to use the protocol
configured by the command's tf to determine its protocol.
Fixes: 6f2ff1a1311e ("hisi_sas: add v2 path to send ATA command")
Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Link: https://lore.kernel.org/r/20250220090011.313848-1-liyihang9@huawei.com
Reviewed-by: Yihang Li <liyihang9@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
strscpy_pad() already NUL-terminates 'data' at the corresponding
indexes. Remove any unnecessary NUL-terminations.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250314221626.43174-2-thorsten.blum@linux.dev
Reviewed-by: Karan Tilak Kumar <kartilak@cisco.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
destroy_workqueue() already drains the queue before destroying it, so
there is no need to flush it explicitly.
Remove the redundant flush_workqueue() calls.
This was generated with coccinelle:
@@
expression E;
@@
- flush_workqueue(E);
destroy_workqueue(E);
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20250312074320.1430175-1-nichen@iscas.ac.cn
Reviewed-by: Karan Tilak Kumar <kartilak@cisco.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During a checkpoint, the current active segment X may not be handled
properly. This occurs when segment X has 0 valid blocks and a non-zero
number of discard blocks, for the following reasons:
locate_dirty_segment() does not mark any active segment as a prefree
segment. As a result, segment X is not included in dirty_segmap[PRE], and
f2fs_clear_prefree_segments() skips it when handling prefree segments.
add_discard_addrs() skips any segment with 0 valid blocks, so segment X is
also skipped.
Consequently, no `struct discard_cmd` is actually created for segment X.
However, the ckpt_valid_map and cur_valid_map of segment X are synced by
seg_info_to_raw_sit() during the current checkpoint process. As a result,
it cannot find the missing discard bits even in subsequent checkpoints.
Consequently, the value of sbi->discard_blks remains non-zero. Thus, when
f2fs is umounted, CP_TRIMMED_FLAG will not be set due to the non-zero
sbi->discard_blks.
Relevant code process:
f2fs_write_checkpoint()
f2fs_flush_sit_entries()
list_for_each_entry_safe(ses, tmp, head, set_list) {
for_each_set_bit_from(segno, bitmap, end) {
...
add_discard_addrs(sbi, cpc, false); // skip segment X due to its 0 valid blocks
...
seg_info_to_raw_sit(); // sync ckpt_valid_map with cur_valid_map for segment X
...
}
}
f2fs_clear_prefree_segments(); // segment X is not included in dirty_segmap[PRE] and is skipped
This issue is easy to reproduce with the following operations:
root # mkfs.f2fs -f /dev/f2fs_dev
root # mount -t f2fs /dev/f2fs_dev /mnt_point
root # dd if=/dev/blk_dev of=/mnt_point/1.bin bs=4k count=256
root # sync
root # rm /mnt_point/1.bin
root # umount /mnt_point
root # dump.f2fs /dev/f2fs_dev | grep "checkpoint state"
Info: checkpoint state = 45 : crc compacted_summary unmount ---- 'trimmed' flag is missing
Since add_discard_addrs() can handle active segments with non-zero valid
blocks, it is reasonable to fix this issue by allowing it to also handle
active segments with 0 valid blocks.
Fixes: b29555505d81 ("f2fs: add key functions for small discards")
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
this is unnecessary when we know we are overwriting already allocated
blocks and the overhead of starting a transaction can be significant
especially for multithreaded workloads doing small writes.
Signed-off-by: Yohan Joung <yohan.joung@sk.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The guidelines mention that firmware updates can't break the kernel,
but it doesn't state directly that they can't break userspace programs.
Make it explicit that firmware updates cannot break UAPI.
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Acked-by: Dave Airlie <airlied@redhat.com>
[jc: fixed "no trailing newline"]
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250314100137.2972355-1-jacek.lawrynowicz@linux.intel.com
|
|
Guest FPUs manage vCPU FPU states. They are allocated via
fpu_alloc_guest_fpstate() and are resized in fpstate_realloc() when XFD
features are enabled.
Since the introduction of guest FPUs, there have been inconsistencies in
the kernel buffer size and xfeatures:
1. fpu_alloc_guest_fpstate() uses fpu_user_cfg since its introduction. See:
69f6ed1d14c6 ("x86/fpu: Provide infrastructure for KVM FPU cleanup")
36487e6228c4 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features")
2. __fpstate_reset() references fpu_kernel_cfg to set storage attributes.
3. fpu->guest_perm uses fpu_kernel_cfg, affecting fpstate_realloc().
A recent commit in the tip:x86/fpu tree partially addressed the inconsistency
between (1) and (3) by using fpu_kernel_cfg for size calculation in (1),
but left fpu_guest->xfeatures and fpu_guest->perm still referencing
fpu_user_cfg:
https://lore.kernel.org/all/20250218141045.85201-1-stanspas@amazon.de/
1937e18cc3cf ("x86/fpu: Fix guest FPU state buffer allocation size")
The inconsistencies within fpu_alloc_guest_fpstate() and across the
mentioned functions cause confusion.
Fix them by using fpu_kernel_cfg consistently in fpu_alloc_guest_fpstate(),
except for fields related to the UABI buffer. Referencing fpu_kernel_cfg
won't impact functionalities, as:
1. fpu_guest->perm is overwritten shortly in fpu_init_guest_permissions()
with fpstate->guest_perm, which already uses fpu_kernel_cfg.
2. fpu_guest->xfeatures is solely used to check if XFD features are enabled.
Including supervisor xfeatures doesn't affect the check.
Fixes: 36487e6228c4 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features")
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Link: https://lore.kernel.org/r/20250317140613.1761633-1-chao.gao@intel.com
|
|
Point out that explicit permission is usually needed to tag other people
in changes, but mention that implicit permission can be sufficient in
certain cases. This fixes slight inconsistencies between Reported-by:
and Suggested-by: and makes the usage more intuitive.
While at it, explicitly mention the dangers of our bugzilla instance, as
it makes it easy to forget that email addresses visible there are only
shown to logged-in users.
The latter is not a theoretical issue, as one maintainer mentioned that
his employer received a EU GDPR (general data protection regulation)
complaint after exposing a email address used in bugzilla through a tag
in a patch description.
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/588cf2763baa8fea1f4825f4eaa7023fe88bb6c1.1738852082.git.linux@leemhuis.info
|
|
The highuid.rst document describes a transition that is outdated and no
longer relevant. Additionally, it references filesystems (ncpfs and smbfs),
which have been removed or replaced.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Kang Taeho <kangtaeho2456@gmail.com>
Link: https://lore.kernel.org/r/20250313145650.278346-1-kangtaeho2456@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
The IBS software filter is filtering kernel samples for regular users in
the PMI handler. It checks the instruction address in the IBS register to
determine if it was in kernel mode or not.
But it turns out that it's possible to report a kernel data address even
if the instruction address belongs to user-space. Matteo Rizzo
found that when an instruction raises an exception, IBS can report some
kernel data addresses like IDT while holding the faulting instruction's
RIP. To prevent an information leak, it should double check if the data
address in PERF_SAMPLE_DATA is in the kernel space as well.
[ mingo: Clarified the changelog ]
Suggested-by: Matteo Rizzo <matteorizzo@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250317163755.1842589-1-namhyung@kernel.org
|
|
If @server->tcpStatus is set to CifsNeedReconnect after acquiring
@ses->session_mutex in smb2_reconnect() or cifs_reconnect_tcon(), it
means that a concurrent thread failed to negotiate, in which case the
server is no longer responding to any SMB requests, so there is no
point making the caller retry the IO by returning -EAGAIN.
Fix this by returning -EHOSTDOWN to the callers on soft mounts.
Cc: David Howells <dhowells@redhat.com>
Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"The majority of these last fixes are for devicetree files.
These address two important regressions for the Qualcomm SMMU and the
Raspberry Pi 4 USB controller, as well as a larger number of patches
fixing minor mistakes in board specific files for Rockchips, i.MX,
starfive and broadcom.
The non-DT changes are
- A fix for an old boot regression on Renesas shmobile chips
- Another boot time regression for for the Qualcomm PDR SoC driver,
among a few other Qualcomm firmware driver fixes for efivars and
tzmem
- Minor Kconfig fixes for davinci and OMAP1
- Minor code fixes for sparx5 reset controllers, OMAP memory
controller, i.MX SCU, cpufreq and SoC drivers and a Hisilicon SoC
driver
- One more update to the Asahi maintainers, adding Neal Gompa as a
reviewer"
* tag 'soc-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits)
ARM: davinci: da850: fix selecting ARCH_DAVINCI_DA8XX
soc: hisilicon: kunpeng_hccs: Fix incorrect string assembly
memory: omap-gpmc: drop no compatible check
reset: mchp: sparx5: Fix for lan966x
ARM: shmobile: smp: Enforce shmobile_smp_* alignment
MAINTAINERS: Add myself (Neal Gompa) as a reviewer for ARM Apple support
MAINTAINERS: Add apple-spi driver & binding files
arm64: dts: rockchip: slow down emmc freq for rock 5 itx
ARM: dts: BCM5301X: Fix switch port labels of ASUS RT-AC3200
ARM: dts: BCM5301X: Fix switch port labels of ASUS RT-AC5300
ARM: dts: bcm2711: Don't mark timer regs unconfigured
ARM: OMAP1: select CONFIG_GENERIC_IRQ_CHIP
arm64: dts: rockchip: Add missing PCIe supplies to RockPro64 board dtsi
arm64: dts: rockchip: Add avdd HDMI supplies to RockPro64 board dtsi
arm64: dts: rockchip: Remove undocumented sdmmc property from lubancat-1
arm64: dts: rockchip: fix pinmux of UART5 for PX30 Ringneck on Haikou
arm64: dts: rockchip: fix pinmux of UART0 for PX30 Ringneck on Haikou
arm64: dts: rockchip: fix u2phy1_host status for NanoPi R4S
arm64: dts: bcm2712: PL011 UARTs are actually r1p5
ARM: dts: bcm2711: PL011 UARTs are actually r1p5
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Clean up tprobe correctly when module unload
Tracepoint probes do not set TRACEPOINT_STUB on the 'tpoint' pointer
when unloading a module, thus they show as a normal 'fprobe' instead
of 'tprobe' and never come back
- Fix leakage of tprobe module refcount
When a tprobe's target module is loaded, it gets the module's
refcount in the module notifier but forgot to put it after
registering the probe on it.
Fix it by getting the refcount only when registering tprobe.
* tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: tprobe-events: Fix leakage of module refcount
tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
|
|
Merge series from Takashi Iwai <tiwai@suse.de>:
This is a revised series of small and trivial patches to convert to
the newer PM macros, e.g. from SET_RUNTIME_PM_OPS() to
RUNTIME_PM_OPS().
The conversions are systematic, and we could reduce messy
__maybe_unused and ifdefs with those changes.
Merely code refactoring, and shouldn't change the actual driver
behavior.
|
|
Jiayuan Chen says:
====================
bpftool: Using the right format specifiers
This patch adds the -Wformat-signedness compiler flag to detect and
prevent format string errors, where signed or unsigned types are
mismatched with format specifiers. Additionally, it fixes some format
string errors that were not fully addressed by the previous patch [1].
[1] https://lore.kernel.org/bpf/20250207123706.727928-1-mrpre@163.com/T/#u
---
v1->v2:
https://lore.kernel.org/bpf/20250310142037.45932-1-jiayuan.chen@linux.dev/
---
====================
Link: https://patch.msgid.link/20250311112809.81901-1-jiayuan.chen@linux.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Fixed some formatting specifiers errors, such as using %d for int and %u
for unsigned int, as well as other byte-length types.
Perform type cast using the type derived from the data type itself, for
example, if it's originally an int, it will be cast to unsigned int if
forced to unsigned.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev
|
|
This commit adds the -Wformat-signedness compiler flag to detect and
prevent printf format errors, where signed or unsigned types are
mismatched with format specifiers. This helps to catch potential issues at
compile-time, ensuring that our code is more robust and reliable. With
this flag, the compiler will now warn about incorrect format strings, such
as using %d with unsigned types or %u with signed types.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250311112809.81901-2-jiayuan.chen@linux.dev
|
|
Mykyta Yatsenko says:
====================
Support freplace prog from user namespace
From: Mykyta Yatsenko <yatsenko@meta.com>
Freplace programs can't be loaded from user namespace, as
bpf_program__set_attach_target() requires searching for target prog BTF,
which is locked under CAP_SYS_ADMIN.
This patch set enables this use case by:
1. Relaxing capable check in bpf's BPF_BTF_GET_FD_BY_ID, check for CAP_BPF
instead of CAP_SYS_ADMIN, support BPF token in attr argument.
2. Pass BPF token around libbpf from bpf_program__set_attach_target() to
bpf syscall where capable check is.
3. Validate positive/negative scenarios in selftests
This patch set is enabled by the recent libbpf change[1], that
introduced bpf_object__prepare() API. Calling bpf_object__prepare() for
freplace program before bpf_program__set_attach_target() initializes BPF
token, which is then passed to bpf syscall by libbpf.
[1] https://lore.kernel.org/all/20250303135752.158343-1-mykyta.yatsenko5@gmail.com/
====================
Link: https://patch.msgid.link/20250317174039.161275-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Add selftests to verify that it is possible to load freplace program
from user namespace if BPF token is initialized by bpf_object__prepare
before calling bpf_program__set_attach_target.
Negative test is added as well.
Modified type of the priv_prog to xdp, as kprobe did not work on aarch64
and s390x.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-5-mykyta.yatsenko5@gmail.com
|
|
Pass BPF token from bpf_program__set_attach_target to
BPF_BTF_GET_FD_BY_ID bpf command.
When freplace program attaches to target program, it needs to look up
for BTF of the target, this may require BPF token, if, for example,
running from user namespace.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-4-mykyta.yatsenko5@gmail.com
|
|
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable
check. This patch enables scenario, when freplace program, running
from user namespace, requires to query target prog's btf.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com
|
|
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not
allow running it from user namespace. This creates a problem when
freplace program running from user namespace needs to query target
program BTF.
This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds
support for BPF token that can be passed in attributes to syscall.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com
|
|
'net-stmmac-avoid-unnecessary-work-in-stmmac_release-stmmac_dvr_remove'
Russell King says:
====================
net: stmmac: avoid unnecessary work in stmmac_release()/stmmac_dvr_remove()
This small series is a subset of a RFC I sent earlier. These two
patches remove code that is unnecessary and/or wrong in these paths.
Details in each commit.
====================
Link: https://patch.msgid.link/Z87bpDd7QYYVU0ML@shell.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
stmmac_release() calls phylink_stop() and then goes on to call
stmmac_mac_set(, false). However, phylink_stop() will call
stmmac_mac_link_down() before returning, which will do this work.
Remove this unnecessary call.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Furong Xu <0x1207@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1trcI6-005rn8-GV@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|