summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2024-12-23xfs: wire up getfsmap to the realtime reverse mapping btreeDarrick J. Wong
Connect the getfsmap ioctl to the realtime rmapbt. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: create routine to allocate and initialize a realtime rmap btree inodeDarrick J. Wong
Create a library routine to allocate and initialize an empty realtime rmapbt inode. We'll use this for mkfs and repair. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: wire up rmap map and unmap to the realtime rmapbtDarrick J. Wong
Connect the map and unmap reverse-mapping operations to the realtime rmapbt via the deferred operation callbacks. This enables us to perform rmap operations against the correct btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: wire up a new metafile type for the realtime rmapDarrick J. Wong
Plumb in the pieces we need to embed the root of the realtime rmap btree in an inode's data fork, complete with new metafile type and on-disk interpretation functions. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: add metadata reservations for realtime rmap btreesDarrick J. Wong
Reserve some free blocks so that we will always have enough free blocks in the data volume to handle expansion of the realtime rmap btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: add realtime reverse map inode to metadata directoryDarrick J. Wong
Add a metadir path to select the realtime rmap btree inode and load it at mount time. The rtrmapbt inode will have a unique extent format code, which means that we also have to update the inode validation and flush routines to look for it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: support file data forks containing metadata btreesDarrick J. Wong
Create a new fork format type for metadata btrees. This fork type requires that the inode is in the metadata directory tree, and only applies to the data fork. The actual type of the metadata btree itself is determined by the di_metatype field. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: pretty print metadata file types in error messagesDarrick J. Wong
Create a helper function to turn a metadata file type code into a printable string, and use this to complain about lockdep problems with rtgroup inodes. We'll use this more in the next patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: support recovering rmap intent items targetting realtime extentsDarrick J. Wong
Now that we have rmap on the realtime device and rmap intent items that target the realtime device, log recovery has to support remapping extents on the realtime volume. Make this work. Identify rtrmapbt blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: add a realtime flag to the rmap update log redo itemsDarrick J. Wong
Extend the rmap update (RUI) log items to handle realtime volumes by adding a new log intent item type. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: prepare rmap functions to deal with rtrmapbtDarrick J. Wong
Prepare the high-level rmap functions to deal with the new realtime rmapbt and its slightly different conventions. Provide the ability to talk to either rmapbt or rtrmapbt formats from the same high level code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: add realtime rmap btree operationsDarrick J. Wong
Implement the generic btree operations needed to manipulate rtrmap btree blocks. This is different from the regular rmapbt in that we allocate space from the filesystem at large, and are neither constrained to the free space nor any particular AG. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: realtime rmap btree transaction reservationsDarrick J. Wong
Make sure that there's enough log reservation to handle mapping and unmapping realtime extents. We have to reserve enough space to handle a split in the rtrmapbt to add the record and a second split in the regular rmapbt to record the rtrmapbt split. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: introduce realtime rmap btree ondisk definitionsDarrick J. Wong
Add the ondisk structure definitions for realtime rmap btrees. The realtime rmap btree will be rooted from a hidden inode so it needs to have a separate btree block magic and pointer format. Next, add everything needed to read, write and manipulate rmap btree blocks. This prepares the way for connecting the btree operations implementation, though embedding the rtrmap btree root in the inode comes later in the series. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: allow inode-based btrees to reserve space in the data deviceDarrick J. Wong
Create a new space reservation scheme so that btree metadata for the realtime volume can reserve space in the data device to avoid space underruns. Back when we were testing the rmap and refcount btrees for the data device, people observed occasional shutdowns when xfs_btree_split was called for either of those two btrees. This happened when certain operations (mostly writeback ioends) created new rmap or refcount records, which would expand the size of the btree. If there were no free blocks available the allocation would fail and the split would shut down the filesystem. I considered pre-reserving blocks for btree expansion at the time of a write() call, but there wasn't any good way to attach the reservations to an inode and keep them there all the way to ioend processing. Unlike delalloc reservations which have that indlen mechanism, there's no way to do that for mapped extents; and indlen blocks are given back during the delalloc -> unwritten transition. The solution was to reserve sufficient blocks for rmap/refcount btree expansion at mount time. This is what the XFS_AG_RESV_* flags provide; any expansion of those two btrees can come from the pre-reserved space. This patch brings that pre-reservation ability to inode-rooted btrees so that the rt rmap and refcount btrees can also save room for future expansion. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: support storing records in the inode core rootDarrick J. Wong
Add the necessary flags and code so that we can support storing leaf records in the inode root block of a btree. This hasn't been necessary before, but the realtime rmapbt will need to be able to do this. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: simplify the xfs_rmap_{alloc,free}_extent calling conventionsDarrick J. Wong
Simplify the calling conventions by allowing callers to pass a fsbno (xfs_fsblock_t) directly into these functions, since we're just going to set it in a struct anyway. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: prepare to reuse the dquot pointer space in struct xfs_inodeDarrick J. Wong
Files participating in the metadata directory tree are not accounted to the quota subsystem. Therefore, the i_[ugp]dquot pointers in struct xfs_inode are never used and should always be NULL. In the next patch we want to add a u64 count of fs blocks reserved for metadata btree expansion, but we don't want every inode in the fs to pay the memory price for this feature. The intent is to union those three pointers with the u64 counter, but for that to work we must guard against all access to the dquot pointers for metadata files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: prepare rmap btree cursor tracepoints for realtimeDarrick J. Wong
Rework the rmap btree cursor tracepoints in preparation to handle the realtime rmap btree cursor. Mostly this involves renaming the field to "gbno" and extracting the group number from the cursor. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: add some rtgroup inode helpersDarrick J. Wong
Create some simple helpers to reduce the amount of typing whenever we access rtgroup inodes. Conversion was done with this spatch and some minor reformatting: @@ expression rtg; @@ - rtg->rtg_inodes[XFS_RTGI_BITMAP] + rtg_bitmap(rtg) @@ expression rtg; @@ - rtg->rtg_inodes[XFS_RTGI_SUMMARY] + rtg_summary(rtg) and the CLI command: $ spatch --sp-file /tmp/moo.cocci --dir fs/xfs/ --use-gitgrep --in-place Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: hoist the node iroot update code out of xfs_btree_kill_irootDarrick J. Wong
In preparation for allowing records in an inode btree root, hoist the code that copies keyptrs from an existing node child into the root block to a separate function. Remove some unnecessary conditionals and clean up a few function calls in the new function. Note that this change reorders the ->free_block call with respect to the change in bc_nlevels to make it easier to support inode root leaf blocks in the next patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: hoist the node iroot update code out of xfs_btree_new_irootDarrick J. Wong
In preparation for allowing records in an inode btree root, hoist the code that copies keyptrs from an existing node root into a child block to a separate function. Note that the new function explicitly computes the keys of the new child block and stores that in the root block; while the bmap btree could rely on leaving the key alone, realtime rmap needs to set the new high key. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: tidy up xfs_bmap_broot_realloc a bitDarrick J. Wong
Hoist out the code that migrates broot pointers during a resize operation to avoid code duplication and streamline the caller. Also use the correct bmbt pointer type for the sizeof operation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: make xfs_iroot_realloc a bmap btree functionDarrick J. Wong
Move the inode fork btree root reallocation function part of the btree ops because it's now mostly bmbt-specific code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: make xfs_iroot_realloc take the new numrecs instead of deltasDarrick J. Wong
Change the calling signature of xfs_iroot_realloc to take the ifork and the new number of records in the btree block, not a diff against the current number. This will make the callsites easier to understand. Note that this function is misnamed because it is very specific to the single type of inode-rooted btree supported. This will be addressed in a subsequent patch. Return the new btree root to reduce the amount of code clutter. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: refactor the inode fork memory allocation functionsDarrick J. Wong
Hoist the code that allocates, frees, and reallocates if_broot into a single xfs_iroot_krealloc function. Eventually we're going to push xfs_iroot_realloc into the btree ops structure to handle multiple inode-rooted btrees, but first let's separate out the bits that should stay in xfs_inode_fork.c. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: release the dquot buf outside of qli_lockDarrick J. Wong
Lai Yi reported a lockdep complaint about circular locking: Chain exists of: &lp->qli_lock --> &bch->bc_lock --> &l->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&l->lock); lock(&bch->bc_lock); lock(&l->lock); lock(&lp->qli_lock); I /think/ the problem here is that xfs_dquot_attach_buf during quotacheck will release the buffer while it's holding the qli_lock. Because this is a cached buffer, xfs_buf_rele_cached takes b_lock before decrementing b_hold. Other threads have taught lockdep that a locking dependency chain is bp->b_lock -> bch->bc_lock -> l(ru)->lock; and that another chain is l(ru)->lock -> lp->qli_lock. Hence we do not want to take b_lock while holding qli_lock. Reported-by: syzbot+3126ab3db03db42e7a31@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> # v6.13-rc3 Fixes: ca378189fdfa89 ("xfs: convert quotacheck to attach dquot buffers") Tested-by: syzbot+3126ab3db03db42e7a31@syzkaller.appspotmail.com Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: tidy up xfs_iroot_reallocDarrick J. Wong
Tidy up this function a bit before we start refactoring the memory handling and move the function to the bmbt code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: don't over-report free space or inodes in statvfsDarrick J. Wong
Emmanual Florac reports a strange occurrence when project quota limits are enabled, free space is lower than the remaining quota, and someone runs statvfs: # mkfs.xfs -f /dev/sda # mount /dev/sda /mnt -o prjquota # xfs_quota -x -c 'limit -p bhard=2G 55' /mnt # mkdir /mnt/dir # xfs_io -c 'chproj 55' -c 'chattr +P' -c 'stat -vvvv' /mnt/dir # fallocate -l 19g /mnt/a # df /mnt /mnt/dir Filesystem Size Used Avail Use% Mounted on /dev/sda 20G 20G 345M 99% /mnt /dev/sda 2.0G 0 2.0G 0% /mnt I think the bug here is that xfs_fill_statvfs_from_dquot unconditionally assigns to f_bfree without checking that the filesystem has enough free space to fill the remaining project quota. However, this is a longstanding behavior of xfs so it's unclear what to do here. Cc: <stable@vger.kernel.org> # v2.6.18 Fixes: 932f2c323196c2 ("[XFS] statvfs component of directory/project quota support, code originally by Glen.") Reported-by: Emmanuel Florac <eflorac@intellique.com> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: port xfs_ioc_start_commit to multigrain timestampsDarrick J. Wong
Take advantage of the multigrain timestamp APIs to ensure that nobody can sneak in and write things to a file between starting a file update operation and committing the results. This should have been part of the multigrain timestamp merge, but I forgot to fling it at jlayton when he resubmitted the patchset due to developer bandwidth problems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 4e40eff0b5737c ("fs: add infrastructure for multigrain timestamps") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2024-12-12xfs: return from xfs_symlink_verify early on V4 filesystemsDarrick J. Wong
V4 symlink blocks didn't have headers, so return early if this is a V4 filesystem. Cc: <stable@vger.kernel.org> # v5.1 Fixes: 39708c20ab5133 ("xfs: miscellaneous verifier magic value fixups") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix zero byte checking in the superblock scrubberDarrick J. Wong
The logic to check that the region past the end of the superblock is all zeroes is wrong -- we don't want to check only the bytes past the end of the maximally sized ondisk superblock structure as currently defined in xfs_format.h; we want to check the bytes beyond the end of the ondisk as defined by the feature bits. Port the superblock size logic from xfs_repair and then put it to use in xfs_scrub. Cc: <stable@vger.kernel.org> # v4.15 Fixes: 21fb4cb1981ef7 ("xfs: scrub the secondary superblocks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: check pre-metadir fields correctlyDarrick J. Wong
The checks that were added to the superblock scrubber for metadata directories aren't quite right -- the old inode pointers are now defined to be zeroes until someone else reuses them. Also consolidate the new metadir field checks to one place; they were inexplicably scattered around. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 28d756d4d562dc ("xfs: update sb field checks when metadir is turned on") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't crash on corrupt /quotas direntDarrick J. Wong
If the /quotas dirent points to an inode but the inode isn't loadable (and hence mkdir returns -EEXIST), don't crash, just bail out. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: e80fbe1ad8eff7 ("xfs: use metadir for quota inodes") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't move nondir/nonreg temporary repair files to the metadir namespaceDarrick J. Wong
Only directories or regular files are allowed in the metadata directory tree. Don't move the repair tempfile to the metadir namespace if this is not true; this will cause the inode verifiers to trip. xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip from the regular directory tree to the metadata directory tree if sc->ip is part of the metadata directory tree. However, the scrub setup functions grab sc->ip and create sc->tempip before we actually get around to checking if the file mode is the right type for the scrubber. IOWs, you can invoke the symlink scrubber with the file handle of a subdirectory in the metadir. xrep_setup_symlink will create a temporary symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to set the METADATA flag on the temp symlink, which trips the inode verifier in the inode item precommit, which shuts down the filesystem when expensive checks are turned on. If they're /not/ turned on, then xchk_symlink will return ENOENT when it sees that it's been passed a symlink, but the invalid inode could still get flushed to disk. We don't want that. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 9dc31acb01a1c7 ("xfs: move repair temporary files to the metadata directory tree") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix sb_spino_align checks for large fsblock sizesDarrick J. Wong
For a sparse inodes filesystem, mkfs.xfs computes the values of sb_spino_align and sb_inoalignmt with the following code: int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; if (cfg->sb_feat.crcs_enabled) cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE; sbp->sb_spino_align = cluster_size >> cfg->blocklog; sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK * cfg->inodesize >> cfg->blocklog; On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results in cluster_size = 8192 * (512 / 256) = 16384. As a result, sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately, this trips the new sb_spino_align check that was just added to xfs_validate_sb_common, and the mkfs fails: # mkfs.xfs -f -b size=64k, /dev/sda meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=65536 blocks=324544, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=65536 blocks=5006, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents Discarding blocks...Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: writing AG headers failed, err=22 Prior to commit 59e43f5479cce1 this all worked fine, even if "sparse" inodes are somewhat meaningless when everything fits in a single fsblock. Adjust the checks to handle existing filesystems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 59e43f5479cce1 ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: convert quotacheck to attach dquot buffersDarrick J. Wong
Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either. Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: attach dquot buffer to dquot log item bufferDarrick J. Wong
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel when running fstests with quotas enabled: WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18 CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c <snip> Call trace: __alloc_pages_noprof+0xc9c/0xf18 alloc_pages_mpol_noprof+0x94/0x240 alloc_pages_noprof+0x68/0xf8 new_slab+0x3e0/0x568 ___slab_alloc+0x5a0/0xb88 __slab_alloc.constprop.0+0x7c/0xf8 __kmalloc_noprof+0x404/0x4d0 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88] xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88] kthread+0x110/0x128 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- This corresponds to the line: WARN_ON_ONCE(current->flags & PF_MEMALLOC); within the NOFAIL checks. What's happening here is that the XFS AIL is trying to write a disk quota update back into the filesystem, but for that it needs to read the ondisk buffer for the dquot. The buffer is not in memory anymore, probably because it was evicted. Regardless, the buffer cache tries to allocate a new buffer, but those allocations are NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim) since commit 43ff2122e6492b ("xfs: on-stack delayed write buffer lists") presumably because reclaim can push on XFS to push on the AIL. An easy way to fix this probably would have been to drop the NOFAIL flag from the xfs_buf allocation and open code a retry loop, but then there's still the problem that for bs>ps filesystems, the buffer itself could require up to 64k worth of pages. Inode items had similar behavior (multi-page cluster buffers that we don't want to allocate in the AIL) which we solved by making transaction precommit attach the inode cluster buffers to the dirty log item. Let's solve the dquot problem in the same way. So: Make a real precommit handler to read the dquot buffer and attach it to the log item; pass it to dqflush in the push method; and have the iodone function detach the buffer once we've flushed everything. Add a state flag to the log item to track when a thread has entered the precommit -> push mechanism to skip the detaching if it turns out that the dquot is very busy, as we don't hold the dquot lock between log item commit and AIL push). Reading and attaching the dquot buffer in the precommit hook is inspired by the work done for inode cluster buffers some time ago. Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: clean up log item accesses in xfs_qm_dqflush{,_done}Darrick J. Wong
Clean up these functions a little bit before we move on to the real modifications, and make the variable naming consistent for dquot log items. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: separate dquot buffer reads from xfs_dqflushDarrick J. Wong
The first step towards holding the dquot buffer in the li_buf instead of reading it in the AIL is to separate the part that reads the buffer from the actual flush code. There should be no functional changes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't lose solo dquot update transactionsDarrick J. Wong
Quota counter updates are tracked via incore objects which hang off the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_dquot_deltas just prior to commiting the log items to the CIL. However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure quota counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb. Cc: <stable@vger.kernel.org> # v2.6.35 Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't lose solo superblock counter update transactionsDarrick J. Wong
Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL. However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: avoid nested calls to __xfs_trans_commitDarrick J. Wong
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls __xfs_trans_commit again on the same transaction. In other words, there's a nested function call (albeit with slightly different arguments) that has caused minor amounts of confusion in the past. There's no reason to keep this around, since there's only one place where we actually want the xfs_defer_finish_noroll, and that is in the top level xfs_trans_commit call. This also reduces stack usage a little bit. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: only run precommits once per transaction objectDarrick J. Wong
Committing a transaction tx0 with a defer ops chain of (A, B, C) creates a chain of transactions that looks like this: tx0 -> txA -> txB -> txC Prior to commit cb042117488dbf, __xfs_trans_commit would run precommits on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C]. Unfortunately, after the finish_noroll loop we forgot to run precommits on txC. That was fixed by adding the second precommit call. Unfortunately, none of us remembered that xfs_defer_finish_noroll calls __xfs_trans_commit a second time to commit tx0 before finishing work A in txA and committing that. In other words, we run precommits twice on tx0: xfs_trans_commit(tx0) __xfs_trans_commit(tx0, false) xfs_trans_run_precommits(tx0) xfs_defer_finish_noroll(tx0) xfs_trans_roll(tx0) txA = xfs_trans_dup(tx0) __xfs_trans_commit(tx0, true) xfs_trans_run_precommits(tx0) This currently isn't an issue because the inode item precommit is idempotent; the iunlink item precommit deletes itself so it can't be called again; and the buffer/dquot item precommits only check the incore objects for corruption. However, it doesn't make sense to run precommits twice. Fix this situation by only running precommits after finish_noroll. Cc: <stable@vger.kernel.org> # v6.4 Fixes: cb042117488dbf ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: unlock inodes when erroring out of xfs_trans_alloc_dirDarrick J. Wong
Debugging a filesystem patch with generic/475 caused the system to hang after observing the following sequences in dmesg: XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5 XFS (dm-0): log I/O error -5 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs] XFS (dm-0): Corruption detected. Unmount and run xfs_repair XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7 The system is stuck in unmount trying to lock a couple of inodes so that they can be purged. The dquot corruption notice above is a clue to what happened -- a link() call tried to set up a transaction to link a child into a directory. Quota reservation for the transaction failed after IO errors shut down the filesystem, but then we forgot to unlock the inodes on our way out. Fix that. Cc: <stable@vger.kernel.org> # v6.10 Fixes: bd5562111d5839 ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix scrub tracepoints when inode-rooted btrees are involvedDarrick J. Wong
Fix a minor mistakes in the scrub tracepoints that can manifest when inode-rooted btrees are enabled. The existing code worked fine for bmap btrees, but we should tighten the code up to be less sloppy. Cc: <stable@vger.kernel.org> # v5.7 Fixes: 92219c292af8dd ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: update btree keys correctly when _insrec splits an inode root blockDarrick J. Wong
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec would erroneously try to update the parent's key for a block that had been split if we decided to insert the new record into the new block. The solution was to detect this situation and update the in-core key value that we pass up to the caller so that the caller will (eventually) add the new block to the parent level of the tree with the correct key. However, I missed a subtlety about the way inode-rooted btrees work. If the full block was a maximally sized inode root block, we'll solve that fullness by moving the root block's records to a new block, resizing the root block, and updating the root to point to the new block. We don't pass a pointer to the new block to the caller because that work has already been done. The new record will /always/ land in the new block, so in this case we need to use xfs_btree_update_keys to update the keys. This bug can theoretically manifest itself in the very rare case that we split a bmbt root block and the new record lands in the very first slot of the new block, though I've never managed to trigger it in practice. However, it is very easy to reproduce by running generic/522 with the realtime rmapbt patchset if rtinherit=1. Cc: <stable@vger.kernel.org> # v4.8 Fixes: 2c813ad66a7218 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix error bailout in xfs_rtginode_createDarrick J. Wong
smatch reported that we screwed up the error cleanup in this function. Fix it. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: ae897e0bed0f54 ("xfs: support creating per-RTG files in growfs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix null bno_hint handling in xfs_rtallocate_rtgDarrick J. Wong
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka NULLFSBLOCK). If the allocation request is for a file range that's adjacent to an existing mapping, it will then change bno_hint to the blkno hint in the bmalloca structure. In other words, bno_hint is either a rt block number, or it's all 1s. Unfortunately, commit ec12f97f1b8a8f didn't take the NULLRTBLOCK state into account, which means that it tries to translate that into a realtime extent number. We then end up with an obnoxiously high rtx number and pointlessly feed that to the near allocator. This often fails and falls back to the by-size allocator. Seeing as we had no locality hint anyway, this is a waste of time. Fix the code to detect a lack of bno_hint correctly. This was detected by running xfs/009 with metadir enabled and a 28k rt extent size. Cc: <stable@vger.kernel.org> # v6.12 Fixes: ec12f97f1b8a8f ("xfs: make the rtalloc start hint a xfs_rtblock_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: mark metadir repair tempfiles with IRECOVERYDarrick J. Wong
Once in a long while, xfs/566 and xfs/801 report directory corruption in one of the metadata subdirectories while it's forcibly rebuilding all filesystem metadata. I observed the following sequence of events: 1. Initiate a repair of the parent pointers for the /quota/user file. This is the secret file containing user quota data. 2. The pptr repair thread creates a temporary file and begins staging parent pointers in the ondisk metadata in preparation for an exchange-range to commit the new pptr data. 3. At the same time, initiate a repair of the /quota directory itself. 4. The dir repair thread finds the temporary file from (2), scans it for parent pointers, and stages a dirent in its own temporary dir in preparation to commit the fixed directory. 5. The parent pointer repair completes and frees the temporary file. 6. The dir repair commits the new directory and scans it again. It finds the dirent that points to the old temporary file in (2) and marks the directory corrupt. Oops! Repair code must never scan the temporary files that other repair functions create to stage new metadata. They're not supposed to do that, but the predicate function xrep_is_tempfile is incorrect because it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary file, but xrep_tempfile_adjust_directory_tree creates exactly that. Fix this by setting the IRECOVERY flag on temporary metadata directory inodes and using that to correct the predicate. Repair code is supposed to erase all the data in temporary files before releasing them, so it's ok if a thread scans the temporary file after we drop IRECOVERY. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: bb6cdd5529ff67 ("xfs: hide metadata inodes from everyone because they are special") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>