Age | Commit message (Collapse) | Author |
|
drm_mode_duplicate() could return NULL due to lack of memory,
which will then call NULL pointer dereference. Add a check to
prevent it.
Fixes: 0ef94554dc40 ("drm/panel: himax-hx83102: Break out as separate driver")
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://lore.kernel.org/r/20241025073408.27481-3-zhangzekun11@huawei.com
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241025073408.27481-3-zhangzekun11@huawei.com
|
|
Instead of jumping to the Xen hypercall page for doing the iret
hypercall, directly code the required sequence in xen-asm.S.
This is done in preparation of no longer using hypercall page at all,
as it has shown to cause problems with speculation mitigations.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
|
|
The syscall instruction is used in Xen PV mode for doing hypercalls.
Allow syscall to be used in the kernel in case it is tagged with an
unwind hint for objtool.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
|
|
In order to be able to differentiate between AMD and Intel based
systems for very early hypercalls without having to rely on the Xen
hypercall page, make get_cpu_vendor() non-static.
Refactor early_cpu_init() for the same reason by splitting out the
loop initializing cpu_devs() into an externally callable function.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
When removing a netfront device directly after a suspend/resume cycle
it might happen that the queues have not been setup again, causing a
crash during the attempt to stop the queues another time.
Fix that by checking the queues are existing before trying to stop
them.
This is XSA-465 / CVE-2024-53240.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: d50b7914fae0 ("xen-netfront: Fix NULL sring after live migration")
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
esre_attribute::store() is not needed since commit af97a77bc01c (efi:
Move some sysfs files to be read-only by root). Drop it.
Found by https://github.com/jirislaby/clang-struct.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into next-rc
xfs: bug fixes for 6.13 [01/12]
Bug fixes for 6.13.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Mina Almasry says:
====================
devmem TCP fixes
Couple unrelated devmem TCP fixes bundled in a series for some
convenience.
- fix naming and provide page_pool_alloc_netmem for fragged
netmem.
- fix issues with dma-buf dma addresses being potentially
passed to dma_sync_for_* helpers.
====================
Link: https://patch.msgid.link/20241211212033.1684197-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dmabuf dma-addresses should not be dma_sync'd for CPU/device. Typically
its the driver responsibility to dma_sync for CPU, but the driver should
not dma_sync for CPU if the netmem is actually coming from a dmabuf
memory provider.
The page_pool already exposes a helper for dma_sync_for_cpu:
page_pool_dma_sync_for_cpu. Upgrade this existing helper to handle
netmem, and have it skip dma_sync if the memory is from a dmabuf memory
provider. Drivers should migrate to using this helper when adding
support for netmem.
Also minimize the impact on the dma syncing performance for pages. Special
case the dma-sync path for pages to not go through the overhead checks
for dma-syncing and conversion to netmem.
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-5-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move the `dma_map` and `dma_sync` checks to `page_pool_init` to make
them generic. Set dma_sync to false for devmem memory provider because
the dma_sync APIs should not be used for dma_buf backed devmem memory
provider.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create page_pool_alloc_netmem to be the mirror of page_pool_alloc.
This enables drivers that want currently use page_pool_alloc to
transition to netmem by converting the call sites to
page_pool_alloc_netmem.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241211212033.1684197-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
page_pool_alloc_netmem (without an s) was the mirror of
page_pool_alloc_pages (with an s), which was confusing.
Rename to page_pool_alloc_netmems so it's the mirror of
page_pool_alloc_pages.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241211212033.1684197-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Alexander Lobakin says:
====================
xdp: a fistful of generic changes pt. II (part)
XDP for idpf is currently 5.5 chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).
Part III.2.1 does the following:
* allows mixing pages from several Page Pools within one XDP frame;
* optimizes &xdp_frame structure and removes no-more-used field;
Everything is prereq for libeth_xdp, but will be useful standalone
as well: faster xdp_return_frame_bulk() and xdp_frame fields access.
====================
Link: https://patch.msgid.link/20241211172649.761483-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE)
is repeated across dozens of drivers and really wants a shorthand.
Add a macro which will count args and handle all possible number
from 2 to 5. Semantics:
skb_frag_dma_map(dev, frag) ->
__skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE)
skb_frag_dma_map(dev, frag, offset) ->
__skb_frag_dma_map(dev, frag, offset, skb_frag_size(frag) - offset,
DMA_TO_DEVICE)
skb_frag_dma_map(dev, frag, offset, size) ->
__skb_frag_dma_map(dev, frag, offset, size, DMA_TO_DEVICE)
skb_frag_dma_map(dev, frag, offset, size, dir) ->
__skb_frag_dma_map(dev, frag, offset, size, dir)
No object code size changes for the existing callers. Users passing
less arguments also won't have bigger size comparing to the full
equivalent call.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-11-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, __xdp_return() takes pointer to the virtual memory to free
a buffer. Apart from that this sometimes provokes redundant
data <--> page conversions, taking data pointer effectively prevents
lots of XDP code to support non-page-backed buffers, as there's no
mapping for the non-host memory (data is always NULL).
Just convert it to always take netmem reference. For
xdp_return_{buff,frame*}(), this chops off one page_address() per each
frag and adds one virt_to_netmem() (same as virt_to_page()) per header
buffer. For __xdp_return() itself, it removes one virt_to_page() for
MEM_TYPE_PAGE_POOL and another one for MEM_TYPE_PAGE_ORDER0, adding
one page_address() for [not really common nowadays]
MEM_TYPE_PAGE_SHARED, but the main effect is that the abovementioned
functions won't die or memleak anymore if the frame has non-host memory
attached and will correctly free those.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Initially, xdp_frame::mem.id was used to search for the corresponding
&page_pool to return the page correctly.
However, after that struct page was extended to have a direct pointer
to its PP (netmem has it as well), further keeping of this field makes
no sense. xdp_return_frame_bulk() still used it to do a lookup, and
this leftover is now removed.
Remove xdp_frame::mem and replace it with ::mem_type, as only memory
type still matters and we need to know it to be able to free the frame
correctly.
As a cute side effect, we can now make every scalar field in &xdp_frame
of 4 byte width, speeding up accesses to them.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The main reason for this change was to allow mixing pages from different
&page_pools within one &xdp_buff/&xdp_frame. Why not? With stuff like
devmem and io_uring zerocopy Rx, it's required to have separate PPs for
header buffers and payload buffers.
Adjust xdp_return_frame_bulk() and page_pool_put_netmem_bulk(), so that
they won't be tied to a particular pool. Let the latter create a
separate bulk of pages which's PP is different from the first netmem of
the bulk and process it after the main loop.
This greatly optimizes xdp_return_frame_bulk(): no more hashtable
lookups and forced flushes on PP mismatch. Also make
xdp_flush_frame_bulk() inline, as it's just one if + function call + one
u32 read, not worth extending the call ladder.
Co-developed-by: Toke Høiland-Jørgensen <toke@redhat.com> # iterative
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org> # while (count)
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Take advantage of the multigrain timestamp APIs to ensure that nobody
can sneak in and write things to a file between starting a file update
operation and committing the results. This should have been part of the
multigrain timestamp merge, but I forgot to fling it at jlayton when he
resubmitted the patchset due to developer bandwidth problems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 4e40eff0b5737c ("fs: add infrastructure for multigrain timestamps")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
V4 symlink blocks didn't have headers, so return early if this is a V4
filesystem.
Cc: <stable@vger.kernel.org> # v5.1
Fixes: 39708c20ab5133 ("xfs: miscellaneous verifier magic value fixups")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The logic to check that the region past the end of the superblock is all
zeroes is wrong -- we don't want to check only the bytes past the end of
the maximally sized ondisk superblock structure as currently defined in
xfs_format.h; we want to check the bytes beyond the end of the ondisk as
defined by the feature bits.
Port the superblock size logic from xfs_repair and then put it to use in
xfs_scrub.
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 21fb4cb1981ef7 ("xfs: scrub the secondary superblocks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The checks that were added to the superblock scrubber for metadata
directories aren't quite right -- the old inode pointers are now defined
to be zeroes until someone else reuses them. Also consolidate the new
metadir field checks to one place; they were inexplicably scattered
around.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 28d756d4d562dc ("xfs: update sb field checks when metadir is turned on")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If the /quotas dirent points to an inode but the inode isn't loadable
(and hence mkdir returns -EEXIST), don't crash, just bail out.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: e80fbe1ad8eff7 ("xfs: use metadir for quota inodes")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Only directories or regular files are allowed in the metadata directory
tree. Don't move the repair tempfile to the metadir namespace if this
is not true; this will cause the inode verifiers to trip.
xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip
from the regular directory tree to the metadata directory tree if sc->ip
is part of the metadata directory tree. However, the scrub setup
functions grab sc->ip and create sc->tempip before we actually get
around to checking if the file mode is the right type for the scrubber.
IOWs, you can invoke the symlink scrubber with the file handle of a
subdirectory in the metadir. xrep_setup_symlink will create a temporary
symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to
set the METADATA flag on the temp symlink, which trips the inode
verifier in the inode item precommit, which shuts down the filesystem
when expensive checks are turned on. If they're /not/ turned on, then
xchk_symlink will return ENOENT when it sees that it's been passed a
symlink, but the invalid inode could still get flushed to disk. We
don't want that.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 9dc31acb01a1c7 ("xfs: move repair temporary files to the metadata directory tree")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
For a sparse inodes filesystem, mkfs.xfs computes the values of
sb_spino_align and sb_inoalignmt with the following code:
int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
if (cfg->sb_feat.crcs_enabled)
cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
sbp->sb_spino_align = cluster_size >> cfg->blocklog;
sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
cfg->inodesize >> cfg->blocklog;
On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384. As a result,
sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:
# mkfs.xfs -f -b size=64k, /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=65536 blocks=324544, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=65536 blocks=5006, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=65536 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
Discarding blocks...Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: writing AG headers failed, err=22
Prior to commit 59e43f5479cce1 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock. Adjust the checks to handle existing filesystems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 59e43f5479cce1 ("xfs: sb_spino_align is not verified")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Now that we've converted the dquot logging machinery to attach the dquot
buffer to the li_buf pointer so that the AIL dqflush doesn't have to
allocate or read buffers in a reclaim path, do the same for the
quotacheck code so that the reclaim shrinker dqflush call doesn't have
to do that either.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel
when running fstests with quotas enabled:
WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18
CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c
<snip>
Call trace:
__alloc_pages_noprof+0xc9c/0xf18
alloc_pages_mpol_noprof+0x94/0x240
alloc_pages_noprof+0x68/0xf8
new_slab+0x3e0/0x568
___slab_alloc+0x5a0/0xb88
__slab_alloc.constprop.0+0x7c/0xf8
__kmalloc_noprof+0x404/0x4d0
xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88]
kthread+0x110/0x128
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
This corresponds to the line:
WARN_ON_ONCE(current->flags & PF_MEMALLOC);
within the NOFAIL checks. What's happening here is that the XFS AIL is
trying to write a disk quota update back into the filesystem, but for
that it needs to read the ondisk buffer for the dquot. The buffer is
not in memory anymore, probably because it was evicted. Regardless, the
buffer cache tries to allocate a new buffer, but those allocations are
NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim)
since commit 43ff2122e6492b ("xfs: on-stack delayed write buffer lists")
presumably because reclaim can push on XFS to push on the AIL.
An easy way to fix this probably would have been to drop the NOFAIL flag
from the xfs_buf allocation and open code a retry loop, but then there's
still the problem that for bs>ps filesystems, the buffer itself could
require up to 64k worth of pages.
Inode items had similar behavior (multi-page cluster buffers that we
don't want to allocate in the AIL) which we solved by making transaction
precommit attach the inode cluster buffers to the dirty log item. Let's
solve the dquot problem in the same way.
So: Make a real precommit handler to read the dquot buffer and attach it
to the log item; pass it to dqflush in the push method; and have the
iodone function detach the buffer once we've flushed everything. Add a
state flag to the log item to track when a thread has entered the
precommit -> push mechanism to skip the detaching if it turns out that
the dquot is very busy, as we don't hold the dquot lock between log item
commit and AIL push).
Reading and attaching the dquot buffer in the precommit hook is inspired
by the work done for inode cluster buffers some time ago.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Clean up these functions a little bit before we move on to the real
modifications, and make the variable naming consistent for dquot log
items.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The first step towards holding the dquot buffer in the li_buf instead of
reading it in the AIL is to separate the part that reads the buffer from
the actual flush code. There should be no functional changes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Quota counter updates are tracked via incore objects which hang off the
xfs_trans object. These changes are then turned into dirty log items in
xfs_trans_apply_dquot_deltas just prior to commiting the log items to
the CIL.
However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be
set on the transaction. In other words, a pure quota counter update
will be silently discarded if there are no other dirty log items
attached to the transaction.
This is currently not the case anywhere in the filesystem because quota
updates always dirty at least one other metadata item, but a subsequent
bug fix will add dquot log item precommits, so we actually need a dirty
dquot log item prior to xfs_trans_run_precommits. Also let's not leave
a logic bomb.
Cc: <stable@vger.kernel.org> # v2.6.35
Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Superblock counter updates are tracked via per-transaction counters in
the xfs_trans object. These changes are then turned into dirty log
items in xfs_trans_apply_sb_deltas just prior to commiting the log items
to the CIL.
However, updating the per-transaction counter deltas do not cause
XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb
counter update will be silently discarded if there are no other dirty
log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb
counter updates always dirty at least one other metadata item, but let's
not leave a logic bomb.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls
__xfs_trans_commit again on the same transaction. In other words,
there's a nested function call (albeit with slightly different
arguments) that has caused minor amounts of confusion in the past.
There's no reason to keep this around, since there's only one place
where we actually want the xfs_defer_finish_noroll, and that is in the
top level xfs_trans_commit call.
This also reduces stack usage a little bit.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb042117488dbf, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that. In other words, we run precommits
twice on tx0:
xfs_trans_commit(tx0)
__xfs_trans_commit(tx0, false)
xfs_trans_run_precommits(tx0)
xfs_defer_finish_noroll(tx0)
xfs_trans_roll(tx0)
txA = xfs_trans_dup(tx0)
__xfs_trans_commit(tx0, true)
xfs_trans_run_precommits(tx0)
This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption. However, it doesn't make sense to run
precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: <stable@vger.kernel.org> # v6.4
Fixes: cb042117488dbf ("xfs: defered work could create precommits")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
XFS (dm-0): log I/O error -5
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs]
XFS (dm-0): Corruption detected. Unmount and run xfs_repair
XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged. The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory. Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out. Fix that.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: bd5562111d5839 ("xfs: Hold inode locks in xfs_trans_alloc_dir")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Cc: <stable@vger.kernel.org> # v5.7
Fixes: 92219c292af8dd ("xfs: convert btree cursor inode-private member names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Cc: <stable@vger.kernel.org> # v4.8
Fixes: 2c813ad66a7218 ("xfs: support btrees with overlapping intervals for keys")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
smatch reported that we screwed up the error cleanup in this function.
Fix it.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: ae897e0bed0f54 ("xfs: support creating per-RTG files in growfs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka
NULLFSBLOCK). If the allocation request is for a file range that's
adjacent to an existing mapping, it will then change bno_hint to the
blkno hint in the bmalloca structure.
In other words, bno_hint is either a rt block number, or it's all 1s.
Unfortunately, commit ec12f97f1b8a8f didn't take the NULLRTBLOCK state
into account, which means that it tries to translate that into a
realtime extent number. We then end up with an obnoxiously high rtx
number and pointlessly feed that to the near allocator. This often
fails and falls back to the by-size allocator. Seeing as we had no
locality hint anyway, this is a waste of time.
Fix the code to detect a lack of bno_hint correctly. This was detected
by running xfs/009 with metadir enabled and a 28k rt extent size.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: ec12f97f1b8a8f ("xfs: make the rtalloc start hint a xfs_rtblock_t")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Once in a long while, xfs/566 and xfs/801 report directory corruption in
one of the metadata subdirectories while it's forcibly rebuilding all
filesystem metadata. I observed the following sequence of events:
1. Initiate a repair of the parent pointers for the /quota/user file.
This is the secret file containing user quota data.
2. The pptr repair thread creates a temporary file and begins staging
parent pointers in the ondisk metadata in preparation for an
exchange-range to commit the new pptr data.
3. At the same time, initiate a repair of the /quota directory itself.
4. The dir repair thread finds the temporary file from (2), scans it for
parent pointers, and stages a dirent in its own temporary dir in
preparation to commit the fixed directory.
5. The parent pointer repair completes and frees the temporary file.
6. The dir repair commits the new directory and scans it again. It
finds the dirent that points to the old temporary file in (2) and
marks the directory corrupt.
Oops! Repair code must never scan the temporary files that other repair
functions create to stage new metadata. They're not supposed to do
that, but the predicate function xrep_is_tempfile is incorrect because
it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary
file, but xrep_tempfile_adjust_directory_tree creates exactly that.
Fix this by setting the IRECOVERY flag on temporary metadata directory
inodes and using that to correct the predicate. Repair code is supposed
to erase all the data in temporary files before releasing them, so it's
ok if a thread scans the temporary file after we drop IRECOVERY.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: bb6cdd5529ff67 ("xfs: hide metadata inodes from everyone because they are special")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If we need to reset a symlink target to the "durr it's busted" string,
then we clear the zapped flag as well. However, this should be using
the provided helper so that we don't set the zapped state on an
otherwise ok symlink.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In commit d9041681dd2f53 we introduced some XFS_SICK_*ZAPPED flags so
that the inode record repair code could clean up a damaged inode record
enough to iget the inode but still be able to remember that the higher
level repair code needs to be called. As part of that, we introduced a
xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED
state to be removed if that higher level metadata actually checks out.
This was done by setting additional bits in sick_mask hoping that
xchk_update_health will clear all those bits after a healthy scrub.
Unfortunately, that's not quite what sick_mask means -- bits in that
mask are indeed cleared if the metadata is healthy, but they're set if
the metadata is NOT healthy. fsck is only intended to set the ZAPPED
bits explicitly.
If something else sets the CORRUPT/XCORRUPT state after the
xchk_mark_healthy_if_clean call, we end up marking the metadata zapped.
This can happen if the following sequence happens:
1. Scrub runs, discovers that the metadata is fine but could be
optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag.
That causes the ZAPPED flag to be set in sick_mask because the
metadata is not CORRUPT or XCORRUPT.
2. Repair runs to optimize the metadata.
3. Some other metadata used for cross-referencing in (1) becomes
corrupt.
4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due
to the events in (3).
5. Now the xchk_health_update sets the ZAPPED flag on the metadata we
just repaired. This is not the correct state.
Fix this by moving the "if healthy" mask to a separate field, and only
ever using it to clear the sick state.
Cc: <stable@vger.kernel.org> # v6.8
Fixes: d9041681dd2f53 ("xfs: set inode sick state flags when we zap either ondisk fork")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Way back when we first implemented FICLONE for XFS, life was simple --
either the the entire remapping completed, or something happened and we
had to return an errno explaining what happened. Neither of those
ioctls support returning partial results, so it's all or nothing.
Then things got complicated when copy_file_range came along, because it
actually can return the number of bytes copied, so commit 3f68c1f562f1e4
tried to make it so that we could return a partial result if the
REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can
indicate that the kernel performed a partial deduplication.
Unfortunately, the logic is wrong if an error stops the remapping and
CAN_SHORTEN is not set. Because those callers cannot return partial
results, it is an error for ->remap_file_range to return a positive
quantity that is less than the @len passed in. Implementations really
should be returning a negative errno in this case, because that's what
btrfs (which introduced FICLONE{,RANGE}) did.
Therefore, ->remap_range implementations cannot silently drop an errno
that they might have when the number of bytes remapped is less than the
number of bytes requested and CAN_SHORTEN is not set.
Found by running generic/562 on a 64k fsblock filesystem and wondering
why it reported corrupt files.
Cc: <stable@vger.kernel.org> # v4.20
Fixes: 3fc9f5e409319e ("xfs: remove xfs_reflink_remap_range")
Really-Fixes: 3f68c1f562f1e4 ("xfs: support returning partial reflink results")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
With the nrext64 feature enabled, it's possible for a data fork to have
2^48 extent mappings. Even with a 64k fsblock size, that maps out to
a bmbt containing more than 2^32 blocks. Therefore, this predicate must
return a u64 count to avoid an integer wraparound that will cause scrub
to do the wrong thing.
It's unlikely that any such filesystem currently exists, because the
incore bmbt would consume more than 64GB of kernel memory on its own,
and so far nobody except me has driven a filesystem that far, judging
from the lack of complaints.
Cc: <stable@vger.kernel.org> # v5.19
Fixes: df9ad5cc7a5240 ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In the same vein as the previous patch, there's no point in the metapath
scrub setup function doing a lookup on the quota metadir just so it can
validate that lookups work correctly. Instead, retain the quota
directory inode in memory for the lifetime of the mount so that we can
check this meaningfully.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of
the quota inodes, just grab them from the quotainfo structure. The
whole point of this scrubber is to make sure that the dirents exist, so
it's completely silly to do lookups.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In commit ca6448aed4f10a, we created an "end_daddr" variable to fix
fsmap reporting when the end of the range requested falls in the middle
of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't
notice that the the code sets end_daddr to the last sector of the device
but then uses that quantity to compute the length of the synthesized
mapping.
Zizhi Wo later observed that when end_daddr isn't set, we still don't
report the last fsblock on a device because in that case (aka when
info->last is true), the info->high mapping that we pass to
xfs_getfsmap_group_helper has a startblock that points to the last
fsblock. This is also wrong because the code uses startblock to
compute the length of the synthesized mapping.
Fix the second problem by setting end_daddr unconditionally, and fix the
first problem by setting start_daddr to one past the end of the range to
query.
Cc: <stable@vger.kernel.org> # v6.11
Fixes: ca6448aed4f10a ("xfs: Fix missing interval for missing_owner in xfs fsmap")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reported-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Pull smb server fixes from Steve French:
- fix ctime setting in setattr
- fix reference count on user session to avoid potential race with
session expire
- fix query dir issue
* tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: set ATTR_CTIME flags when setting mtime
ksmbd: fix racy issue from session lookup and expire
ksmbd: retry iterate_dir in smb2_query_dir
|
|
Add three qdisc-specific drop reasons and use them in sch_cake:
1) SKB_DROP_REASON_QDISC_OVERLIMIT
Whenever the total queue limit for a qdisc instance is exceeded
and a packet is dropped to make room.
2) SKB_DROP_REASON_QDISC_CONGESTED
Whenever a packet is dropped by the qdisc AQM algorithm because
congestion is detected.
3) SKB_DROP_REASON_CAKE_FLOOD
Whenever a packet is dropped by the flood protection part of the
CAKE AQM algorithm (BLUE).
Also use the existing SKB_DROP_REASON_QUEUE_PURGE in cake_clear_tin().
Reasons show up as:
perf record -a -e skb:kfree_skb sleep 1; perf script
iperf3 665 [005] 848.656964: skb:kfree_skb: skbaddr=0xffff98168a333500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x10f0 reason: QDISC_OVERLIMIT
swapper 0 [001] 909.166055: skb:kfree_skb: skbaddr=0xffff98168280cee0 rx_sk=(nil) protocol=34525 location=cake_dequeue+0x5ef reason: QDISC_CONGESTED
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241211-cake-drop-reason-v2-1-920afadf4d1b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Namhyung Kim:
"A set of random fixes for this cycle.
perf record:
- Fix build-id event size calculation in perf record
- Fix perf record -C/--cpu option on hybrid systems
- Fix perf mem record with precise-ip on SapphireRapids
perf test:
- Refresh hwmon directory before reading the test files
- Make sure system_tsc_freq event is tested on x86 only
Others:
- Usual header file sync
- Fix undefined behavior in perf ftrace profile
- Properly initialize a return variable in perf probe"
* tag 'perf-tools-fixes-for-v6.13-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (21 commits)
perf probe: Fix uninitialized variable
libperf: evlist: Fix --cpu argument on hybrid platform
perf test expr: Fix system_tsc_freq for only x86
perf test hwmon_pmu: Fix event file location
perf hwmon_pmu: Use openat rather than dup to refresh directory
perf ftrace: Fix undefined behavior in cmp_profile_data()
perf tools: Fix precise_ip fallback logic
perf tools: Fix build error on generated/fs_at_flags_array.c
tools headers: Sync uapi/linux/prctl.h with the kernel sources
tools headers: Sync uapi/linux/mount.h with the kernel sources
tools headers: Sync uapi/linux/fcntl.h with the kernel sources
tools headers: Sync uapi/asm-generic/mman.h with the kernel sources
tools headers: Sync *xattrat syscall changes with the kernel sources
tools headers: Sync arm64 kvm header with the kernel sources
tools headers: Sync x86 kvm and cpufeature headers with the kernel
tools headers: Sync uapi/linux/kvm.h with the kernel sources
tools headers: Sync uapi/linux/perf_event.h with the kernel sources
tools headers: Sync uapi/drm/drm.h with the kernel sources
perf machine: Initialize machine->env to address a segfault
perf test: Don't signal all processes on system when interrupting tests
...
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.13-2024-12-11:
amdgpu:
- ISP hw init fix
- SR-IOV fixes
- Fix contiguous VRAM mapping for UVD on older GPUs
- Fix some regressions due to drm scheduler changes
- Workload profile fixes
- Cleaner shader fix
amdkfd:
- Fix DMA map direction for migration
- Fix a potential null pointer dereference
- Cacheline size fixes
- Runtime PM fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241211215449.741848-1-alexander.deucher@amd.com
|