summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_ag.c
AgeCommit message (Collapse)Author
2024-03-07xfs: shrink failure needs to hold AGI bufferDave Chinner
Chandan reported a AGI/AGF lock order hang on xfs/168 during recent testing. The cause of the problem was the task running xfs_growfs to shrink the filesystem. A failure occurred trying to remove the free space from the btrees that the shrink would make disappear, and that meant it ran the error handling for a partial failure. This error path involves restoring the per-ag block reservations, and that requires calculating the amount of space needed to be reserved for the free inode btree. The growfs operation hung here: [18679.536829] down+0x71/0xa0 [18679.537657] xfs_buf_lock+0xa4/0x290 [xfs] [18679.538731] xfs_buf_find_lock+0xf7/0x4d0 [xfs] [18679.539920] xfs_buf_lookup.constprop.0+0x289/0x500 [xfs] [18679.542628] xfs_buf_get_map+0x2b3/0xe40 [xfs] [18679.547076] xfs_buf_read_map+0xbb/0x900 [xfs] [18679.562616] xfs_trans_read_buf_map+0x449/0xb10 [xfs] [18679.569778] xfs_read_agi+0x1cd/0x500 [xfs] [18679.573126] xfs_ialloc_read_agi+0xc2/0x5b0 [xfs] [18679.578708] xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs] [18679.582480] xfs_ag_resv_init+0x2c5/0x490 [xfs] [18679.586023] xfs_ag_shrink_space+0x736/0xd30 [xfs] [18679.590730] xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs] [18679.599764] xfs_growfs_data+0x2f1/0x410 [xfs] [18679.602212] xfs_file_ioctl+0xd1e/0x1370 [xfs] trying to get the AGI lock. The AGI lock was held by a fstress task trying to do an inode allocation, and it was waiting on the AGF lock to allocate a new inode chunk on disk. Hence deadlock. The fix for this is for the growfs code to hold the AGI over the transaction roll it does in the error path. It already holds the AGF locked across this, and that is what causes the lock order inversion in the xfs_ag_resv_init() call. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-22xfs: hook live rmap operations during a repair operationDarrick J. Wong
Hook the regular rmap code when an rmapbt repair operation is running so that we can unlock the AGF buffer to scan the filesystem and keep the in-memory btree up to date during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: teach buftargs to maintain their own buffer hashtableDarrick J. Wong
Currently, cached buffers are indexed by per-AG hashtables. This works great for the data device, but won't work for in-memory btrees. To handle that use case, buftargs will need to be able to index buffers independently of other data structures. We accomplish this by hoisting the rhashtable and its lock into a separate xfs_buf_cache structure, make the buftarg point to the _buf_cache structure, and rework various functions to use it. This will enable the in-memory buftarg to come up with its own _buf_cache. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: split the agf_roots and agf_levels arraysChristoph Hellwig
Using arrays of largely unrelated fields that use the btree number as index is not very robust. Split the arrays into three separate fields instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: rename btree block/buffer init functionsDarrick J. Wong
Rename xfs_btree_init_block_int to xfs_btree_init_block, and xfs_btree_init_block to xfs_btree_init_buf so that the name suggests the type that caller are supposed to pass in. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: initialize btree blocks using btree_ops structureDarrick J. Wong
Notice now that the btree ops structure encodes btree geometry flags and the magic number through the buffer ops. Refactor the btree block initialization functions to use the btree ops so that we no longer have to open code all that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: report XFS_IS_CORRUPT errors to the health systemDarrick J. Wong
Whenever we encounter XFS_IS_CORRUPT failures, we should report that to the health monitoring system for later reporting. I started with this semantic patch and massaged everything until it built: @@ expression mp, test; @@ - if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED; + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; } @@ expression mp, test; identifier label, error; @@ - if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; } + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; } Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: report fs corruption errors to the health tracking systemDarrick J. Wong
Whenever we encounter corrupt fs metadata, we should report that to the health monitoring system for later reporting. A convenient program for identifying places to insert xfs_*_mark_sick calls is as follows: #!/bin/bash # Detect missing calls to xfs_*_mark_sick filter=cat tty -s && filter=less git grep -B3 EFSCORRUPTED fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch] | awk ' BEGIN { ignore = 0; lineno = 0; delete lines; } { if ($0 == "--") { if (!ignore) { for (i = 0; i < lineno; i++) { print(lines[i]); } printf("--\n"); } delete lines; lineno = 0; ignore = 0; } else if ($0 ~ /mark_sick/) { ignore = 1; } else if ($0 ~ /if .fa/) { ignore = 1; } else if ($0 ~ /failaddr/) { ignore = 1; } else if ($0 ~ /_verifier_error/) { ignore = 1; } else if ($0 ~ /^ \* .*EFSCORRUPTED/) { ignore = 1; } else if ($0 ~ /== -EFSCORRUPTED/) { ignore = 1; } else if ($0 ~ /!= -EFSCORRUPTED/) { ignore = 1; } else { lines[lineno++] = $0; } } ' | $filter Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-13xfs: use __GFP_NOLOCKDEP instead of GFP_NOFSDave Chinner
In the past we've had problems with lockdep false positives stemming from inode locking occurring in memory reclaim contexts (e.g. from superblock shrinkers). Lockdep doesn't know that inodes access from above memory reclaim cannot be accessed from below memory reclaim (and vice versa) but there has never been a good solution to solving this problem with lockdep annotations. This situation isn't unique to inode locks - buffers are also locked above and below memory reclaim, and we have to maintain lock ordering for them - and against inodes - appropriately. IOWs, the same code paths and locks are taken both above and below memory reclaim and so we always need to make sure the lock orders are consistent. We are spared the lockdep problems this might cause by the fact that semaphores and bit locks aren't covered by lockdep. In general, this sort of lockdep false positive detection is cause by code that runs GFP_KERNEL memory allocation with an actively referenced inode locked. When it is run from a transaction, memory allocation is automatically GFP_NOFS, so we don't have reclaim recursion issues. So in the places where we do memory allocation with inodes locked outside of a transaction, we have explicitly set them to use GFP_NOFS allocations to prevent lockdep false positives from being reported if the allocation dips into direct memory reclaim. More recently, __GFP_NOLOCKDEP was added to the memory allocation flags to tell lockdep not to track that particular allocation for the purposes of reclaim recursion detection. This is a much better way of preventing false positives - it allows us to use GFP_KERNEL context outside of transactions, and allows direct memory reclaim to proceed normally without throwing out false positive deadlock warnings. The obvious places that lock inodes and do memory allocation are the lookup paths and inode extent list initialisation. These occur in non-transactional GFP_KERNEL contexts, and so can run direct reclaim and lock inodes. This patch makes a first path through all the explicit GFP_NOFS allocations in XFS and converts the obvious ones to GFP_KERNEL | __GFP_NOLOCKDEP as a first step towards removing explicit GFP_NOFS allocations from the XFS code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert remaining kmem_free() to kfree()Dave Chinner
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether. This conversion was done with: $ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_zalloc() to kzalloc()Dave Chinner
There's no reason to keep the kmem_zalloc() around anymore, it's just a thin wrapper around kmalloc(), so lets get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-22xfs: fix perag leak when growfs failsLong Li
During growfs, if new ag in memory has been initialized, however sb_agcount has not been updated, if an error occurs at this time it will cause perag leaks as follows, these new AGs will not been freed during umount , because of these new AGs are not visible(that is included in mp->m_sb.sb_agcount). unreferenced object 0xffff88810be40200 (size 512): comm "xfs_growfs", pid 857, jiffies 4294909093 hex dump (first 32 bytes): 00 c0 c1 05 81 88 ff ff 04 00 00 00 00 00 00 00 ................ 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 381741e2): [<ffffffff8191aef6>] __kmalloc+0x386/0x4f0 [<ffffffff82553e65>] kmem_alloc+0xb5/0x2f0 [<ffffffff8238dac5>] xfs_initialize_perag+0xc5/0x810 [<ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0 [<ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0 [<ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0 [<ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0 [<ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a unreferenced object 0xffff88810be40800 (size 512): comm "xfs_growfs", pid 857, jiffies 4294909093 hex dump (first 32 bytes): 20 00 00 00 00 00 00 00 57 ef be dc 00 00 00 00 .......W....... 10 08 e4 0b 81 88 ff ff 10 08 e4 0b 81 88 ff ff ................ backtrace (crc bde50e2d): [<ffffffff8191b43a>] __kmalloc_node+0x3da/0x540 [<ffffffff81814489>] kvmalloc_node+0x99/0x160 [<ffffffff8286acff>] bucket_table_alloc.isra.0+0x5f/0x400 [<ffffffff8286bdc5>] rhashtable_init+0x405/0x760 [<ffffffff8238dda3>] xfs_initialize_perag+0x3a3/0x810 [<ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0 [<ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0 [<ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0 [<ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0 [<ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a Factor out xfs_free_unused_perag_range() from xfs_initialize_perag(), used for freeing unused perag within a specified range in error handling, included in the error path of the growfs failure. Fixes: 1c1c6ebcf528 ("xfs: Replace per-ag array with a radix tree") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-22xfs: add lock protection when remove perag from radix treeLong Li
Take mp->m_perag_lock for deletions from the perag radix tree in xfs_initialize_perag to prevent racing with tagging operations. Lookups are fine - they are RCU protected so already deal with the tree changing shape underneath the lookup - but tagging operations require the tree to be stable while the tags are propagated back up to the root. Right now there's nothing stopping radix tree tagging from operating while a growfs operation is progress and adding/removing new entries into the radix tree. Hence we can have traversals that require a stable tree occurring at the same time we are removing unused entries from the radix tree which causes the shape of the tree to change. Likely this hasn't caused a problem in the past because we are only doing append addition and removal so the active AG part of the tree is not changing shape, but that doesn't mean it is safe. Just making the radix tree modifications serialise against each other is obviously correct. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-06xfs: remove __xfs_free_extent_laterDarrick J. Wong
xfs_free_extent_later is a trivial helper, so remove it to reduce the amount of thinking required to understand the deferred freeing interface. This will make it easier to introduce automatic reaping of speculative allocations in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-11xfs: adjust the incore perag block_count when shrinkingDarrick J. Wong
If we reduce the number of blocks in an AG, we must update the incore geometry values as well. Fixes: 0800169e3e2c9 ("xfs: Pre-calculate per-AG agbno geometry") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-06-29xfs: use deferred frees for btree block freeingDave Chinner
Btrees that aren't freespace management trees use the normal extent allocation and freeing routines for their blocks. Hence when a btree block is freed, a direct call to xfs_free_extent() is made and the extent is immediately freed. This puts the entire free space management btrees under this path, so we are stacking btrees on btrees in the call stack. The inobt, finobt and refcount btrees all do this. However, the bmap btree does not do this - it calls xfs_free_extent_later() to defer the extent free operation via an XEFI and hence it gets processed in deferred operation processing during the commit of the primary transaction (i.e. via intent chaining). We need to change xfs_free_extent() to behave in a non-blocking manner so that we can avoid deadlocks with busy extents near ENOSPC in transactions that free multiple extents. Inserting or removing a record from a btree can cause a multi-level tree merge operation and that will free multiple blocks from the btree in a single transaction. i.e. we can call xfs_free_extent() multiple times, and hence the btree manipulation transaction is vulnerable to this busy extent deadlock vector. To fix this, convert all the remaining callers of xfs_free_extent() to use xfs_free_extent_later() to queue XEFIs and hence defer processing of the extent frees to a context that can be safely restarted if a deadlock condition is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-05xfs: validate block number being freed before adding to xefiDave Chinner
Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02xfs: set bnobt/cntbt numrecs correctly when formatting new AGsDarrick J. Wong
Through generic/300, I discovered that mkfs.xfs creates corrupt filesystems when given these parameters: # mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported Filesystems formatted with --unsupported are not supported!! meta-data=/dev/sda isize=512 agcount=8, agsize=16352 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=130816, imaxpct=25 = sunit=32 swidth=128 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=8192, version=2 = sectsz=512 sunit=32 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 = rgcount=0 rgsize=0 blks Discarding blocks...Done. # xfs_repair -n /dev/sda Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... - 16:30:50: zeroing log - 16320 of 16320 blocks done - scan filesystem freespace and inode maps... agf_freeblks 25, counted 0 in ag 4 sb_fdblocks 8823, counted 8798 The root cause of this problem is the numrecs handling in xfs_freesp_init_recs, which is used to initialize a new AG. Prior to calling the function, we set up the new bnobt block with numrecs == 1 and rely on _freesp_init_recs to format that new record. If the last record created has a blockcount of zero, then it sets numrecs = 0. That last bit isn't correct if the AG contains the log, the start of the log is not immediately after the initial blocks due to stripe alignment, and the end of the log is perfectly aligned with the end of the AG. For this case, we actually formatted a single bnobt record to handle the free space before the start of the (stripe aligned) log, and incremented arec to try to format a second record. That second record turned out to be unnecessary, so what we really want is to leave numrecs at 1. The numrecs handling itself is overly complicated because a different function sets numrecs == 1. Change the bnobt creation code to start with numrecs set to zero and only increment it after successfully formatting a free space extent into the btree block. Fixes: f327a00745ff ("xfs: account for log space when formatting new AGs") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-11xfs: allow queued AG intents to drain before scrubbingDarrick J. Wong
When a writer thread executes a chain of log intent items, the AG header buffer locks will cycle during a transaction roll to get from one intent item to the next in a chain. Although scrub takes all AG header buffer locks, this isn't sufficient to guard against scrub checking an AG while that writer thread is in the middle of finishing a chain because there's no higher level locking primitive guarding allocation groups. When there's a collision, cross-referencing between data structures (e.g. rmapbt and refcountbt) yields false corruption events; if repair is running, this results in incorrect repairs, which is catastrophic. Fix this by adding to the perag structure the count of active intents and make scrub wait until it has both AG header buffer locks and the intent counter reaches zero. One quirk of the drain code is that deferred bmap updates also bump and drop the intent counter. A fundamental decision made during the design phase of the reverse mapping feature is that updates to the rmapbt records are always made by the same code that updates the primary metadata. In other words, callers of bmapi functions expect that the bmapi functions will queue deferred rmap updates. Some parts of the reflink code queue deferred refcount (CUI) and bmap (BUI) updates in the same head transaction, but the deferred work manager completely finishes the CUI before the BUI work is started. As a result, the CUI drops the intent count long before the deferred rmap (RUI) update even has a chance to bump the intent count. The only way to keep the intent count elevated between the CUI and RUI is for the BUI to bump the counter until the RUI has been created. A second quirk of the intent drain code is that deferred work items must increment the intent counter as soon as the work item is added to the transaction. When a BUI completes and queues an RUI, the RUI must increment the counter before the BUI decrements it. The only way to accomplish this is to require that the counter be bumped as soon as the deferred work item is created in memory. In the next patches we'll improve on this facility, but this patch provides the basic functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: create traced helper to get extra perag referencesDarrick J. Wong
There are a few places in the XFS codebase where a caller has either an active or a passive reference to a perag structure and wants to give a passive reference to some other piece of code. Btree cursor creation and inode walks are good examples of this. Replace the open-coded logic with a helper to do this. The new function adds a few safeguards -- it checks that there's at least one reference to the perag structure passed in, and it records the refcount bump in the ftrace information. This makes it much easier to debug perag refcounting problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: pass per-ag references to xfs_free_extentDarrick J. Wong
Pass a reference to the per-AG structure to xfs_free_extent. Most callers already have one, so we can eliminate unnecessary lookups. The one exception to this is the EFI code, which the next patch will fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-13xfs: introduce xfs_alloc_vextent_exact_bno()Dave Chinner
Two of the callers to xfs_alloc_vextent_this_ag() actually want exact block number allocation, not anywhere-in-ag allocation. Split this out from _this_ag() as a first class citizen so no external extent allocation code needs to care about args->type anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_this_ag() where appropriateDave Chinner
Change obvious callers of single AG allocation to use xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the callers, too, so that callers with active references don't need to do new lookups just for an allocation in a context that already has a perag reference. The only remaining caller that does single AG allocation through xfs_alloc_vextent() is xfs_bmap_btalloc() with XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before it can be converted cleanly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use active perag references for inode allocationDave Chinner
Convert the inode allocation routines to use active perag references or references held by callers rather than grab their own. Also drive the perag further inwards to replace xfs_mounts when doing operations on a specific AG. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: rework the perag trace points to be perag centricDave Chinner
So that they all output the same information in the traces to make debugging refcount issues easier. This means that all the lookup/drop functions no longer need to use the full memory barrier atomic operations (atomic*_return()) so will have less overhead when tracing is off. The set/clear tag tracepoints no longer abuse the reference count to pass the tag - the tag being cleared is obvious from the _RET_IP_ that is recorded in the trace point. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: active perag reference countingDave Chinner
We need to be able to dynamically remove instantiated AGs from memory safely, either for shrinking the filesystem or paging AG state in and out of memory (e.g. supporting millions of AGs). This means we need to be able to safely exclude operations from accessing perags while dynamic removal is in progress. To do this, introduce the concept of active and passive references. Active references are required for high level operations that make use of an AG for a given operation (e.g. allocation) and pin the perag in memory for the duration of the operation that is operating on the perag (e.g. transaction scope). This means we can fail to get an active reference to an AG, hence callers of the new active reference API must be able to handle lookup failure gracefully. Passive references are used in low level code, where we might need to access the perag structure for the purposes of completing high level operations. For example, buffers need to use passive references because: - we need to be able to do metadata IO during operations like grow and shrink transactions where high level active references to the AG have already been blocked - buffers need to pin the perag until they are reclaimed from memory, something that high level code has no direct control over. - unused cached buffers should not prevent a shrink from being started. Hence we have active references that will form exclusion barriers for operations to be performed on an AG, and passive references that will prevent reclaim of the perag until all objects with passive references have been reclaimed themselves. This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API for active AG reference functionality. We also need to convert the for_each_perag*() iterators to use active references, which will start the process of converting high level code over to using active references. Conversion of non-iterator based code to active references will be done in followup patches. Note that the implementation using reference counting is really just a development vehicle for the API to ensure we don't have any leaks in the callers. Once we need to remove perag structures from memory dyanmically, we will need a much more robust per-ag state transition mechanism for preventing new references from being taken while we wait for existing references to drain before removal from memory can occur.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-14xfs: double link the unlinked inode listDave Chinner
Now we have forwards traversal via the incore inode in place, we now need to add back pointers to the incore inode to entirely replace the back reference cache. We use the same lookup semantics and constraints as for the forwards pointer lookups during unlinks, and so we can look up any inode in the unlinked list directly and update the list pointers, forwards or backwards, at any time. The only wrinkle in converting the unlinked list manipulations to use in-core previous pointers is that log recovery doesn't have the incore inode state built up so it can't just read in an inode and release it to finish off the unlink. Hence we need to modify the traversal in recovery to read one inode ahead before we release the inode at the head of the list. This populates the next->prev relationship sufficient to be able to replay the unlinked list and hence greatly simplify the runtime code. This recovery algorithm also requires that we actually remove inodes from the unlinked list one at a time as background inode inactivation will result in unlinked list removal racing with the building of the in-memory unlinked list state. We could serialise this by holding the AGI buffer lock when constructing the in memory state, but all that does is lockstep background processing with list building. It is much simpler to flush the inodegc immediately after releasing the inode so that it is unlinked immediately and there is no races present at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-07-07xfs: make is_log_ag() a first class helperDave Chinner
We check if an ag contains the log in many places, so make this a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and renaming it xfs_ag_contains_log(). The convert all the places that check if the AG contains the log to use this helper. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: Pre-calculate per-AG agino geometryDave Chinner
There is a lot of overhead in functions like xfs_verify_agino() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agino(), we now always have a perag context handy, so we can store the minimum and maximum agino values in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. xfs_verify_agino_or_null() gets the same perag treatment. xfs_agino_range() is moved to xfs_ag.c as it's not really a type function, and it's use is largely restricted as the first and last aginos can be grabbed straight from the perag in most cases. Note that we leave the original xfs_verify_agino in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agino() to indicate it takes both an agno and an agino to differentiate it from new function. $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1482185 329588 572 1812345 1ba779 (TOTALS) after 1481937 329588 572 1812097 1ba681 (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: Pre-calculate per-AG agbno geometryDave Chinner
There is a lot of overhead in functions like xfs_verify_agbno() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agbno(), we now always have a perag context handy, so we can store the AG length and the minimum valid block in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. Move xfs_ag_block_count() to xfs_ag.c because it's really a per-ag function and not an XFS type function. We need a little bit of rework that is specific to xfs_initialise_perag() to allow growfs to calculate the new perag sizes before we've updated the primary superblock during the grow (chicken/egg situation). Note that we leave the original xfs_verify_agbno in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agbno() to indicate it takes both an agno and an agbno to differentiate it from new function. Future commits will make similar changes for other per-ag geometry validation functions. Further: $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1483006 329588 572 1813166 1baaae (TOTALS) after 1482185 329588 572 1812345 1ba779 (TOTALS) This rework reduces the binary size by ~820 bytes, indicating that much less work is being done to bounds check the agbno values against on per-ag geometry information. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: pass perag to xfs_alloc_read_agf()Dave Chinner
xfs_alloc_read_agf() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Whilst modifying the xfs_reflink_find_shared() function definition, declare it static and remove the extern declaration as it is an internal function only these days. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: kill xfs_alloc_pagf_init()Dave Chinner
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced by passing a NULL agfbp to xfs_alloc_read_agf(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: pass perag to xfs_ialloc_read_agi()Dave Chinner
xfs_ialloc_read_agi() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: kill xfs_ialloc_pagi_init()Dave Chinner
This is just a basic wrapper around xfs_ialloc_read_agi(), which can be entirely handled by xfs_ialloc_read_agi() by passing a NULL agibpp.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: make last AG grow/shrink perag centricDave Chinner
Because the perag must exist for these operations, look it up as part of the common shrink operations and pass it instead of the mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-05-27xfs: don't assert fail on perag references on teardownDave Chinner
Not fatal, the assert is there to catch developer attention. I'm seeing this occasionally during recoveryloop testing after a shutdown, and I don't want this to stop an overnight recoveryloop run as it is currently doing. Convert the ASSERT to a XFS_IS_CORRUPT() check so it will dump a corruption report into the log and cause a test failure that way, but it won't stop the machine dead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2021-11-10xfs: #ifdef out perag code for userspaceEric Sandeen
The xfs_perag structure and initialization is unused in userspace, so #ifdef it out with __KERNEL__ to facilitate the xfsprogs sync and build. Signed-off-by: Eric Sandeen <esandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22xfs: rename xfs_bmap_add_free to xfs_free_extent_laterDarrick J. Wong
xfs_bmap_add_free isn't a block mapping function; it schedules deferred freeing operations for a later point in a compound transaction chain. While it's primarily used by bunmapi, its use has expanded beyond that. Move it to xfs_alloc.c and rename the function since it's now general freeing functionality. Bring the slab cache bits in line with the way we handle the other intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-08-19xfs: convert bp->b_bn references to xfs_buf_daddr()Dave Chinner
Stop directly referencing b_bn in code outside the buffer cache, as b_bn is supposed to be used only as an internal cache index. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert xfs_sb_version_has checks to use mount featuresDave Chinner
This is a conversion of the remaining xfs_sb_version_has..(sbp) checks to use xfs_has_..(mp) feature checks. This was largely done with a vim replacement macro that did: :0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR> A couple of other variants were also used, and the rest touched up by hand. $ size -t fs/xfs/built-in.a text data bss dec hex filename before 1127533 311352 484 1439369 15f689 (TOTALS) after 1125360 311352 484 1437196 15ee0c (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: replace xfs_sb_version checks with feature flag checksDave Chinner
Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-15xfs: check for sparse inode clusters that cross new EOAG when shrinkingDarrick J. Wong
While running xfs/168, I noticed occasional write verifier shutdowns involving inodes at the very end of the filesystem. Existing inode btree validation code checks that all inode clusters are fully contained within the filesystem. However, due to inadequate checking in the fs shrink code, it's possible that there could be a sparse inode cluster at the end of the filesystem where the upper inodes of the cluster are marked as holes and the corresponding blocks are free. In this case, the last blocks in the AG are listed in the bnobt. This enables the shrink to proceed but results in a filesystem that trips the inode verifiers. Fix this by disallowing the shrink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-21xfs: fix endianness issue in xfs_ag_shrink_spaceDarrick J. Wong
The AGI buffer is in big-endian format, so we must convert the endianness to CPU format to do any comparisons. Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-08Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of ↵Darrick J. Wong
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2 xfs: clean up incore inode walk functions This ambitious series aims to cleans up redundant inode walk code in xfs_icache.c, hide implementation details of the quotaoff dquot release code, and eliminates indirect function calls from incore inode walks. The first thing it does is to move all the code that quotaoff calls to release dquots from all incore inodes into xfs_icache.c. Next, it separates the goal of an inode walk from the actual radix tree tags that may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing. Finally, we split the speculative preallocation (blockgc) and quotaoff dquot release code paths into separate functions so that we can keep the implementations cohesive. Christoph suggested last cycle that we 'simply' change quotaoff not to allow deactivating quota entirely, but as these cleanups are to enable one major change in behavior (deferred inode inactivation) I do not want to add a second behavior change (quotaoff) as a dependency. To be blunt: Additional cleanups are not in scope for this series. Next, I made two observations about incore inode radix tree walks -- since there's a 1:1 mapping between the walk goal and the per-inode processing function passed in, we can use the goal to make a direct call to the processing function. Furthermore, the only caller to supply a nonzero iter_flags argument is quotaoff, and there's only one INEW flag. From that observation, I concluded that it's quite possible to remove two parameters from the xfs_inode_walk* function signatures -- the iter_flags, and the execute function pointer. The middle of the series moves the INEW functionality into the one piece (quotaoff) that wants it, and removes the indirect calls. The final observation is that the inode reclaim walk loop is now almost the same as xfs_inode_walk, so it's silly to maintain two copies. Merge the reclaim loop code into xfs_inode_walk. Lastly, refactor the per-ag radix tagging functions since there's duplicated code that can be consolidated. This series is a prerequisite for the next two patchsets, since deferred inode inactivation will add another inode radix tree tag and iterator function to xfs_inode_walk. v2: walk the vfs inode list when running quotaoff instead of the radix tree, then rework the (now completely internal) inode walk function to take the tag as the main parameter. v3: merge the reclaim loop into xfs_inode_walk, then consolidate the radix tree tagging functions v4: rebase to 5.13-rc4 v5: combine with the quotaoff patchset, reorder functions to minimize forward declarations, split inode walk goals from radix tree tags to reduce conceptual confusion v6: start moving the inode cache code towards the xfs_icwalk prefix * tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: refactor per-AG inode tagging functions xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag xfs: pass struct xfs_eofblocks to the inode scan callback xfs: fix radix tree tag signs xfs: make the icwalk processing functions clean up the grab state xfs: clean up inode state flag tests in xfs_blockgc_igrab xfs: remove indirect calls from xfs_inode_walk{,_ag} xfs: remove iter_flags parameter from xfs_inode_walk_* xfs: move xfs_inew_wait call into xfs_dqrele_inode xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab xfs: pass the goal of the incore inode walk to xfs_inode_walk() xfs: rename xfs_inode_walk functions to xfs_icwalk xfs: move the inode walk functions further down xfs: detach inode dquots at the end of inactivation xfs: move the quotaoff dqrele inode walk into xfs_icache.c [djwong: added variable names to function declarations while fixing merge conflicts] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-08Merge tag 'xfs-perag-conv-tag' of ↵Darrick J. Wong
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2 xfs: initial agnumber -> perag conversions for shrink If we want to use active references to the perag to be able to gate shrink removing AGs and hence perags safely, we've got a fair bit of work to do actually use perags in all the places we need to. There's a lot of code that iterates ag numbers and then looks up perags from that, often multiple times for the same perag in the one operation. If we want to use reference counted perags for access control, then we need to convert all these uses to perag iterators, not agno iterators. [Patches 1-4] The first step of this is consolidating all the perag management - init, free, get, put, etc into a common location. THis is spread all over the place right now, so move it all into libxfs/xfs_ag.[ch]. This does expose kernel only bits of the perag to libxfs and hence userspace, so the structures and code is rearranged to minimise the number of ifdefs that need to be added to the userspace codebase. The perag iterator in xfs_icache.c is promoted to a first class API and expanded to the needs of the code as required. [Patches 5-10] These are the first basic perag iterator conversions and changes to pass the perag down the stack from those iterators where appropriate. A lot of this is obvious, simple changes, though in some places we stop passing the perag down the stack because the code enters into an as yet unconverted subsystem that still uses raw AGs. [Patches 11-16] These replace the agno passed in the btree cursor for per-ag btree operations with a perag that is passed to the cursor init function. The cursor takes it's own reference to the perag, and the reference is dropped when the cursor is deleted. Hence we get reference coverage for the entire time the cursor is active, even if the code that initialised the cursor drops it's reference before the cursor or any of it's children (duplicates) have been deleted. The first patch adds the perag infrastructure for the cursor, the next four patches convert a btree cursor at a time, and the last removes the agno from the cursor once it is unused. [Patches 17-21] These patches are a demonstration of the simplifications and cleanups that come from plumbing the perag through interfaces that select and then operate on a specific AG. In this case the inode allocation algorithm does up to three walks across all AGs before it either allocates an inode or fails. Two of these walks are purely just to select the AG, and even then it doesn't guarantee inode allocation success so there's a third walk if the selected AG allocation fails. These patches collapse the selection and allocation into a single loop, simplifies the error handling because xfs_dir_ialloc() always returns ENOSPC if no AG was selected for inode allocation or we fail to allocate an inode in any AG, gets rid of xfs_dir_ialloc() wrapper, converts inode allocation to run entirely from a single perag instance, and then factors xfs_dialloc() into a much, much simpler loop which is easy to understand. Hence we end up with the same inode allocation logic, but it only needs two complete iterations at worst, makes AG selection and allocation atomic w.r.t. shrink and chops out out over 100 lines of code from this hot code path. [Patch 22] Converts the unlink path to pass perags through it. There's more conversion work to be done, but this patchset gets through a large chunk of it in one hit. Most of the iterators are converted, so once this is solidified we can move on to converting these to active references for being able to free perags while the fs is still active. * tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits) xfs: remove xfs_perag_t xfs: use perag through unlink processing xfs: clean up and simplify xfs_dialloc() xfs: inode allocation can use a single perag instance xfs: get rid of xfs_dir_ialloc() xfs: collapse AG selection for inode allocation xfs: simplify xfs_dialloc_select_ag() return values xfs: remove agno from btree cursor xfs: use perag for ialloc btree cursors xfs: convert allocbt cursors to use perags xfs: convert refcount btree cursor to use perags xfs: convert rmap btree cursor to using a perag xfs: add a perag to the btree cursor xfs: pass perags around in fsmap data dev functions xfs: push perags through the ag reservation callouts xfs: pass perags through to the busy extent code xfs: convert secondary superblock walk to use perags xfs: convert xfs_iwalk to use perag references xfs: convert raw ag walks to use for_each_perag xfs: make for_each_perag... a first class citizen ...
2021-06-02xfs: remove xfs_perag_tDave Chinner
Almost unused, gets rid of another typedef. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: convert rmap btree cursor to using a peragDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: move perag structure and setup to libxfs/xfs_ag.[ch]Dave Chinner
Move the xfs_perag infrastructure to the libxfs files that contain all the per AG infrastructure. This helps set up for passing perags around all the code instead of bare agnos with minimal extra includes for existing files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: move xfs_perag_get/put to xfs_ag.[ch]Dave Chinner
They are AG functions, not superblock functions, so move them to the appropriate location. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01xfs: use xfs_buf_alloc_pages for uncached buffersDave Chinner
Use the newly factored out page allocation code. This adds automatic buffer zeroing for non-read uncached buffers. This also allows us to greatly simply the error handling in xfs_buf_get_uncached(). Because xfs_buf_alloc_pages() cleans up partial allocation failure, we can just call xfs_buf_free() in all error cases now to clean up after failures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>