From 96355d5a1f0ee6dcc182c37db4894ec0c29f1692 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:48:45 -0700 Subject: xfs: Don't allow logging of XFS_ISTALE inodes In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 5daef654956c..59dea8178ae3 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1141,7 +1141,7 @@ restart: goto out_ifunlock; xfs_iunpin_wait(ip); } - if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) { + if (xfs_inode_clean(ip)) { xfs_ifunlock(ip); goto reclaim; } @@ -1228,6 +1228,7 @@ reclaim: xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_qm_dqdetach(ip); xfs_iunlock(ip, XFS_ILOCK_EXCL); + ASSERT(xfs_inode_clean(ip)); __xfs_inode_free(ip); return error; -- cgit From 993f951f501c85e963b3664739c07196a286eac7 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:16 -0700 Subject: xfs: make inode reclaim almost non-blocking Now that dirty inode writeback doesn't cause read-modify-write cycles on the inode cluster buffer under memory pressure, the need to throttle memory reclaim to the rate at which we can clean dirty inodes goes away. That is due to the fact that we no longer thrash inode cluster buffers under memory pressure to clean dirty inodes. This means inode writeback no longer stalls on memory allocation or read IO, and hence can be done asynchronously without generating memory pressure. As a result, blocking inode writeback in reclaim is no longer necessary to prevent reclaim priority windup as cleaning dirty inodes is no longer dependent on having memory reserves available for the filesystem to make progress reclaiming inodes. Hence we can convert inode reclaim to be non-blocking for shrinker callouts, both for direct reclaim and kswapd. On a vanilla kernel, running a 16-way fsmark create workload on a 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via userspace mlock(). The OOM killer gets invoked at 15GB of pinned RAM. Without the inode cluster pinning, this non-blocking reclaim patch triggers premature OOM killer invocation with the same memory pinning, sometimes with as much as 45% of RAM being free. It's trivially easy to trigger the OOM killer when reclaim does not block. With pinning inode clusters in RAM and then adding this patch, I can reliably pin 14.5GB of RAM and still have the fsmark workload run to completion. The OOM killer gets invoked 14.75GB of pinned RAM, which is only a small amount of memory less than the vanilla kernel. It is much more reliable than just with async reclaim alone. simoops shows that allocation stalls go away when async reclaim is used. Vanilla kernel: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792) Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936) Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640) work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70) alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02) With inode cluster pinning and async reclaim: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216) Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504) Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256) work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34) alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03) Latencies don't really change much, nor does the work rate. However, allocation almost never stalls with these changes, whilst the vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample period. This difference is due to inode reclaim being largely non-blocking now. IOWs, once we have pinned inode cluster buffers, we can make inode reclaim non-blocking without a major risk of premature and/or spurious OOM killer invocation, and without any changes to memory reclaim infrastructure. Signed-off-by: Dave Chinner Reviewed-by: Amir Goldstein Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 59dea8178ae3..01f8efc5f59c 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1402,7 +1402,7 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); + return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); } /* -- cgit From 617825fe3489ac231790e5c843107168838b8547 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:16 -0700 Subject: xfs: remove IO submission from xfs_reclaim_inode() We no longer need to issue IO from shrinker based inode reclaim to prevent spurious OOM killer invocation. This leaves only the global filesystem management operations such as unmount needing to writeback dirty inodes and reclaim them. Instead of using the reclaim pass to write dirty inodes before reclaiming them, use the AIL to push all the dirty inodes before we try to reclaim them. This allows us to remove all the conditional SYNC_WAIT locking and the writeback code from xfs_reclaim_inode() and greatly simplify the checks we need to do to reclaim an inode. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 117 ++++++++++++++-------------------------------------- 1 file changed, 31 insertions(+), 86 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 01f8efc5f59c..fa4df9b4edb5 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1111,24 +1111,17 @@ xfs_reclaim_inode_grab( * dirty, async => requeue * dirty, sync => flush, wait and reclaim */ -STATIC int +static bool xfs_reclaim_inode( struct xfs_inode *ip, struct xfs_perag *pag, int sync_mode) { - struct xfs_buf *bp = NULL; xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ - int error; -restart: - error = 0; xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out; - xfs_iflock(ip); - } + if (!xfs_iflock_nowait(ip)) + goto out; if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { xfs_iunpin_wait(ip); @@ -1136,52 +1129,12 @@ restart: xfs_iflush_abort(ip); goto reclaim; } - if (xfs_ipincount(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out_ifunlock; - xfs_iunpin_wait(ip); - } - if (xfs_inode_clean(ip)) { - xfs_ifunlock(ip); - goto reclaim; - } - - /* - * Never flush out dirty data during non-blocking reclaim, as it would - * just contend with AIL pushing trying to do the same job. - */ - if (!(sync_mode & SYNC_WAIT)) + if (xfs_ipincount(ip)) + goto out_ifunlock; + if (!xfs_inode_clean(ip)) goto out_ifunlock; - /* - * Now we have an inode that needs flushing. - * - * Note that xfs_iflush will never block on the inode buffer lock, as - * xfs_ifree_cluster() can lock the inode buffer before it locks the - * ip->i_lock, and we are doing the exact opposite here. As a result, - * doing a blocking xfs_imap_to_bp() to get the cluster buffer would - * result in an ABBA deadlock with xfs_ifree_cluster(). - * - * As xfs_ifree_cluser() must gather all inodes that are active in the - * cache to mark them stale, if we hit this case we don't actually want - * to do IO here - we want the inode marked stale so we can simply - * reclaim it. Hence if we get an EAGAIN error here, just unlock the - * inode, back off and try again. Hopefully the next pass through will - * see the stale flag set on the inode. - */ - error = xfs_iflush(ip, &bp); - if (error == -EAGAIN) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - /* backoff longer than in xfs_ifree_cluster */ - delay(2); - goto restart; - } - - if (!error) { - error = xfs_bwrite(bp); - xfs_buf_relse(bp); - } - + xfs_ifunlock(ip); reclaim: ASSERT(!xfs_isiflocked(ip)); @@ -1231,21 +1184,14 @@ reclaim: ASSERT(xfs_inode_clean(ip)); __xfs_inode_free(ip); - return error; + return true; out_ifunlock: xfs_ifunlock(ip); out: - xfs_iflags_clear(ip, XFS_IRECLAIM); xfs_iunlock(ip, XFS_ILOCK_EXCL); - /* - * We could return -EAGAIN here to make reclaim rescan the inode tree in - * a short while. However, this just burns CPU time scanning the tree - * waiting for IO to complete and the reclaim work never goes back to - * the idle state. Instead, return 0 to let the next scheduled - * background reclaim attempt to reclaim the inode again. - */ - return 0; + xfs_iflags_clear(ip, XFS_IRECLAIM); + return false; } /* @@ -1253,21 +1199,22 @@ out: * corrupted, we still want to try to reclaim all the inodes. If we don't, * then a shut down during filesystem unmount reclaim walk leak all the * unreclaimed inodes. + * + * Returns non-zero if any AGs or inodes were skipped in the reclaim pass + * so that callers that want to block until all dirty inodes are written back + * and reclaimed can sanely loop. */ -STATIC int +static int xfs_reclaim_inodes_ag( struct xfs_mount *mp, int flags, int *nr_to_scan) { struct xfs_perag *pag; - int error = 0; - int last_error = 0; xfs_agnumber_t ag; int trylock = flags & SYNC_TRYLOCK; int skipped; -restart: ag = 0; skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { @@ -1341,9 +1288,8 @@ restart: for (i = 0; i < nr_found; i++) { if (!batch[i]) continue; - error = xfs_reclaim_inode(batch[i], pag, flags); - if (error && last_error != -EFSCORRUPTED) - last_error = error; + if (!xfs_reclaim_inode(batch[i], pag, flags)) + skipped++; } *nr_to_scan -= XFS_LOOKUP_BATCH; @@ -1359,19 +1305,7 @@ restart: mutex_unlock(&pag->pag_ici_reclaim_lock); xfs_perag_put(pag); } - - /* - * if we skipped any AG, and we still have scan count remaining, do - * another pass this time using blocking reclaim semantics (i.e - * waiting on the reclaim locks and ignoring the reclaim cursors). This - * ensure that when we get more reclaimers than AGs we block rather - * than spin trying to execute reclaim. - */ - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { - trylock = 0; - goto restart; - } - return last_error; + return skipped; } int @@ -1380,8 +1314,18 @@ xfs_reclaim_inodes( int mode) { int nr_to_scan = INT_MAX; + int skipped; - return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + if (!(mode & SYNC_WAIT)) + return 0; + + do { + xfs_ail_push_all_sync(mp->m_ail); + skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + } while (skipped > 0); + + return 0; } /* @@ -1402,7 +1346,8 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + return 0; } /* -- cgit From 0e8e2c6343dd74a4f55f8507a9fae9064d456436 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:16 -0700 Subject: xfs: allow multiple reclaimers per AG Inode reclaim will still throttle direct reclaim on the per-ag reclaim locks. This is no longer necessary as reclaim can run non-blocking now. Hence we can remove these locks so that we don't arbitrarily block reclaimers just because there are more direct reclaimers than there are AGs. This can result in multiple reclaimers working on the same range of an AG, but this doesn't cause any apparent issues. Optimising the spread of concurrent reclaimers for best efficiency can be done in a future patchset. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 31 ++++++++++++------------------- 1 file changed, 12 insertions(+), 19 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index fa4df9b4edb5..592eab23c6e7 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1211,12 +1211,9 @@ xfs_reclaim_inodes_ag( int *nr_to_scan) { struct xfs_perag *pag; - xfs_agnumber_t ag; - int trylock = flags & SYNC_TRYLOCK; - int skipped; + xfs_agnumber_t ag = 0; + int skipped = 0; - ag = 0; - skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; int done = 0; @@ -1224,15 +1221,13 @@ xfs_reclaim_inodes_ag( ag = pag->pag_agno + 1; - if (trylock) { - if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) { - skipped++; - xfs_perag_put(pag); - continue; - } - first_index = pag->pag_ici_reclaim_cursor; - } else - mutex_lock(&pag->pag_ici_reclaim_lock); + /* + * If the cursor is not zero, we haven't scanned the whole AG + * so we might have skipped inodes here. + */ + first_index = READ_ONCE(pag->pag_ici_reclaim_cursor); + if (first_index) + skipped++; do { struct xfs_inode *batch[XFS_LOOKUP_BATCH]; @@ -1298,11 +1293,9 @@ xfs_reclaim_inodes_ag( } while (nr_found && !done && *nr_to_scan > 0); - if (trylock && !done) - pag->pag_ici_reclaim_cursor = first_index; - else - pag->pag_ici_reclaim_cursor = 0; - mutex_unlock(&pag->pag_ici_reclaim_lock); + if (done) + first_index = 0; + WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index); xfs_perag_put(pag); } return skipped; -- cgit From 9552e14d3e879a3b4281427ef368271f371ea167 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:17 -0700 Subject: xfs: don't block inode reclaim on the ILOCK When we attempt to reclaim an inode, the first thing we do is take the inode lock. This is blocking right now, so if the inode being accessed by something else (e.g. being flushed to the cluster buffer) we will block here. Change this to a trylock so that we do not block inode reclaim unnecessarily here. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 592eab23c6e7..f387ec21dd35 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1119,9 +1119,10 @@ xfs_reclaim_inode( { xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) goto out; + if (!xfs_iflock_nowait(ip)) + goto out_iunlock; if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { xfs_iunpin_wait(ip); @@ -1188,8 +1189,9 @@ reclaim: out_ifunlock: xfs_ifunlock(ip); -out: +out_iunlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); +out: xfs_iflags_clear(ip, XFS_IRECLAIM); return false; } -- cgit From 50718b8d73dda01bb168f9f3b16f6311a2debe7b Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 1 Jul 2020 10:21:05 -0700 Subject: xfs: remove SYNC_TRYLOCK from inode reclaim All background reclaim is SYNC_TRYLOCK already, and even blocking reclaim (SYNC_WAIT) can use trylock mechanisms as xfs_reclaim_inodes_ag() will keep cycling until there are no more reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode reclaim and make everything unconditionally non-blocking. We remove all the optimistic "avoid blocking on locks" checks done in xfs_reclaim_inode_grab() as nothing blocks on locks anymore. Further, checking XFS_IFLOCK optimistically can result in detecting inodes in the process of being cleaned (i.e. between being removed from the AIL and having the flush lock dropped), so for xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop these checks anyway. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 63 +++++++++++++++++++++-------------------------------- 1 file changed, 25 insertions(+), 38 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index f387ec21dd35..8d18117242e1 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -174,7 +174,7 @@ xfs_reclaim_worker( struct xfs_mount *mp = container_of(to_delayed_work(work), struct xfs_mount, m_reclaim_work); - xfs_reclaim_inodes(mp, SYNC_TRYLOCK); + xfs_reclaim_inodes(mp, 0); xfs_reclaim_work_queue(mp); } @@ -1028,48 +1028,37 @@ xfs_cowblocks_worker( /* * Grab the inode for reclaim exclusively. - * Return 0 if we grabbed it, non-zero otherwise. + * + * We have found this inode via a lookup under RCU, so the inode may have + * already been freed, or it may be in the process of being recycled by + * xfs_iget(). In both cases, the inode will have XFS_IRECLAIM set. If the inode + * has been fully recycled by the time we get the i_flags_lock, XFS_IRECLAIMABLE + * will not be set. Hence we need to check for both these flag conditions to + * avoid inodes that are no longer reclaim candidates. + * + * Note: checking for other state flags here, under the i_flags_lock or not, is + * racy and should be avoided. Those races should be resolved only after we have + * ensured that we are able to reclaim this inode and the world can see that we + * are going to reclaim it. + * + * Return true if we grabbed it, false otherwise. */ -STATIC int +static bool xfs_reclaim_inode_grab( - struct xfs_inode *ip, - int flags) + struct xfs_inode *ip) { ASSERT(rcu_read_lock_held()); - /* quick check for stale RCU freed inode */ - if (!ip->i_ino) - return 1; - - /* - * If we are asked for non-blocking operation, do unlocked checks to - * see if the inode already is being flushed or in reclaim to avoid - * lock traffic. - */ - if ((flags & SYNC_TRYLOCK) && - __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) - return 1; - - /* - * The radix tree lock here protects a thread in xfs_iget from racing - * with us starting reclaim on the inode. Once we have the - * XFS_IRECLAIM flag set it will not touch us. - * - * Due to RCU lookup, we may find inodes that have been freed and only - * have XFS_IRECLAIM set. Indeed, we may see reallocated inodes that - * aren't candidates for reclaim at all, so we must check the - * XFS_IRECLAIMABLE is set first before proceeding to reclaim. - */ spin_lock(&ip->i_flags_lock); if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) || __xfs_iflags_test(ip, XFS_IRECLAIM)) { /* not a reclaim candidate. */ spin_unlock(&ip->i_flags_lock); - return 1; + return false; } __xfs_iflags_set(ip, XFS_IRECLAIM); spin_unlock(&ip->i_flags_lock); - return 0; + return true; } /* @@ -1114,8 +1103,7 @@ xfs_reclaim_inode_grab( static bool xfs_reclaim_inode( struct xfs_inode *ip, - struct xfs_perag *pag, - int sync_mode) + struct xfs_perag *pag) { xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ @@ -1209,7 +1197,6 @@ out: static int xfs_reclaim_inodes_ag( struct xfs_mount *mp, - int flags, int *nr_to_scan) { struct xfs_perag *pag; @@ -1254,7 +1241,7 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { struct xfs_inode *ip = batch[i]; - if (done || xfs_reclaim_inode_grab(ip, flags)) + if (done || !xfs_reclaim_inode_grab(ip)) batch[i] = NULL; /* @@ -1285,7 +1272,7 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { if (!batch[i]) continue; - if (!xfs_reclaim_inode(batch[i], pag, flags)) + if (!xfs_reclaim_inode(batch[i], pag)) skipped++; } @@ -1311,13 +1298,13 @@ xfs_reclaim_inodes( int nr_to_scan = INT_MAX; int skipped; - xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, &nr_to_scan); if (!(mode & SYNC_WAIT)) return 0; do { xfs_ail_push_all_sync(mp->m_ail); - skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan); } while (skipped > 0); return 0; @@ -1341,7 +1328,7 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, &nr_to_scan); return 0; } -- cgit From 4d0bab3a44686f26be7ee7295c6c1987605ae35e Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 1 Jul 2020 10:21:28 -0700 Subject: xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Clean up xfs_reclaim_inodes() callers. Most callers want blocking behaviour, so just make the existing SYNC_WAIT behaviour the default. For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag() directly because we just want optimistic clean inode reclaim to be done in the background. For xfs_quiesce_attr() we can just remove the inode reclaim calls as they are a historic relic that was required to flush dirty inodes that contained unlogged changes. We now log all changes to the inodes, so the sync AIL push from xfs_log_quiesce() called by xfs_quiesce_attr() will do all the required inode writeback for freeze. Seeing as we now want to loop until all reclaimable inodes have been reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG tag rather than having xfs_reclaim_inodes_ag() tell it that inodes were skipped. This is much more reliable and will always loop until all reclaimable inodes are reclaimed. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 79 ++++++++++++++++++++--------------------------------- 1 file changed, 29 insertions(+), 50 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 8d18117242e1..f4e7b98d9639 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -160,24 +160,6 @@ xfs_reclaim_work_queue( rcu_read_unlock(); } -/* - * This is a fast pass over the inode cache to try to get reclaim moving on as - * many inodes as possible in a short period of time. It kicks itself every few - * seconds, as well as being kicked by the inode cache shrinker when memory - * goes low. It scans as quickly as possible avoiding locked inodes or those - * already being flushed, and once done schedules a future pass. - */ -void -xfs_reclaim_worker( - struct work_struct *work) -{ - struct xfs_mount *mp = container_of(to_delayed_work(work), - struct xfs_mount, m_reclaim_work); - - xfs_reclaim_inodes(mp, 0); - xfs_reclaim_work_queue(mp); -} - static void xfs_perag_set_reclaim_tag( struct xfs_perag *pag) @@ -1100,7 +1082,7 @@ xfs_reclaim_inode_grab( * dirty, async => requeue * dirty, sync => flush, wait and reclaim */ -static bool +static void xfs_reclaim_inode( struct xfs_inode *ip, struct xfs_perag *pag) @@ -1173,7 +1155,7 @@ reclaim: ASSERT(xfs_inode_clean(ip)); __xfs_inode_free(ip); - return true; + return; out_ifunlock: xfs_ifunlock(ip); @@ -1181,7 +1163,6 @@ out_iunlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); out: xfs_iflags_clear(ip, XFS_IRECLAIM); - return false; } /* @@ -1194,14 +1175,13 @@ out: * so that callers that want to block until all dirty inodes are written back * and reclaimed can sanely loop. */ -static int +static void xfs_reclaim_inodes_ag( struct xfs_mount *mp, int *nr_to_scan) { struct xfs_perag *pag; xfs_agnumber_t ag = 0; - int skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; @@ -1210,14 +1190,7 @@ xfs_reclaim_inodes_ag( ag = pag->pag_agno + 1; - /* - * If the cursor is not zero, we haven't scanned the whole AG - * so we might have skipped inodes here. - */ first_index = READ_ONCE(pag->pag_ici_reclaim_cursor); - if (first_index) - skipped++; - do { struct xfs_inode *batch[XFS_LOOKUP_BATCH]; int i; @@ -1270,16 +1243,12 @@ xfs_reclaim_inodes_ag( rcu_read_unlock(); for (i = 0; i < nr_found; i++) { - if (!batch[i]) - continue; - if (!xfs_reclaim_inode(batch[i], pag)) - skipped++; + if (batch[i]) + xfs_reclaim_inode(batch[i], pag); } *nr_to_scan -= XFS_LOOKUP_BATCH; - cond_resched(); - } while (nr_found && !done && *nr_to_scan > 0); if (done) @@ -1287,27 +1256,18 @@ xfs_reclaim_inodes_ag( WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index); xfs_perag_put(pag); } - return skipped; } -int +void xfs_reclaim_inodes( - xfs_mount_t *mp, - int mode) + struct xfs_mount *mp) { int nr_to_scan = INT_MAX; - int skipped; - xfs_reclaim_inodes_ag(mp, &nr_to_scan); - if (!(mode & SYNC_WAIT)) - return 0; - - do { + while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) { xfs_ail_push_all_sync(mp->m_ail); - skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan); - } while (skipped > 0); - - return 0; + xfs_reclaim_inodes_ag(mp, &nr_to_scan); + }; } /* @@ -1426,6 +1386,25 @@ xfs_inode_matches_eofb( return true; } +/* + * This is a fast pass over the inode cache to try to get reclaim moving on as + * many inodes as possible in a short period of time. It kicks itself every few + * seconds, as well as being kicked by the inode cache shrinker when memory + * goes low. It scans as quickly as possible avoiding locked inodes or those + * already being flushed, and once done schedules a future pass. + */ +void +xfs_reclaim_worker( + struct work_struct *work) +{ + struct xfs_mount *mp = container_of(to_delayed_work(work), + struct xfs_mount, m_reclaim_work); + int nr_to_scan = INT_MAX; + + xfs_reclaim_inodes_ag(mp, &nr_to_scan); + xfs_reclaim_work_queue(mp); +} + STATIC int xfs_inode_free_eofblocks( struct xfs_inode *ip, -- cgit From 02511a5a6a49f9730ad215caa77e4d980008c6c6 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:18 -0700 Subject: xfs: clean up inode reclaim comments Inode reclaim is quite different now to the way described in various comments, so update all the comments explaining what it does and how it works. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 128 ++++++++++++++-------------------------------------- 1 file changed, 35 insertions(+), 93 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index f4e7b98d9639..dc90a81abb1a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -141,11 +141,8 @@ xfs_inode_free( } /* - * Queue a new inode reclaim pass if there are reclaimable inodes and there - * isn't a reclaim pass already in progress. By default it runs every 5s based - * on the xfs periodic sync default of 30s. Perhaps this should have it's own - * tunable, but that can be done if this method proves to be ineffective or too - * aggressive. + * Queue background inode reclaim work if there are reclaimable inodes and there + * isn't reclaim work already scheduled or in progress. */ static void xfs_reclaim_work_queue( @@ -600,48 +597,31 @@ out_destroy: } /* - * Look up an inode by number in the given file system. - * The inode is looked up in the cache held in each AG. - * If the inode is found in the cache, initialise the vfs inode - * if necessary. + * Look up an inode by number in the given file system. The inode is looked up + * in the cache held in each AG. If the inode is found in the cache, initialise + * the vfs inode if necessary. * - * If it is not in core, read it in from the file system's device, - * add it to the cache and initialise the vfs inode. + * If it is not in core, read it in from the file system's device, add it to the + * cache and initialise the vfs inode. * * The inode is locked according to the value of the lock_flags parameter. - * This flag parameter indicates how and if the inode's IO lock and inode lock - * should be taken. - * - * mp -- the mount point structure for the current file system. It points - * to the inode hash table. - * tp -- a pointer to the current transaction if there is one. This is - * simply passed through to the xfs_iread() call. - * ino -- the number of the inode desired. This is the unique identifier - * within the file system for the inode being requested. - * lock_flags -- flags indicating how to lock the inode. See the comment - * for xfs_ilock() for a list of valid values. + * Inode lookup is only done during metadata operations and not as part of the + * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup. */ int xfs_iget( - xfs_mount_t *mp, - xfs_trans_t *tp, - xfs_ino_t ino, - uint flags, - uint lock_flags, - xfs_inode_t **ipp) + struct xfs_mount *mp, + struct xfs_trans *tp, + xfs_ino_t ino, + uint flags, + uint lock_flags, + struct xfs_inode **ipp) { - xfs_inode_t *ip; - int error; - xfs_perag_t *pag; - xfs_agino_t agino; + struct xfs_inode *ip; + struct xfs_perag *pag; + xfs_agino_t agino; + int error; - /* - * xfs_reclaim_inode() uses the ILOCK to ensure an inode - * doesn't get freed while it's being referenced during a - * radix tree traversal here. It assumes this function - * aqcuires only the ILOCK (and therefore it has no need to - * involve the IOLOCK in this synchronization). - */ ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0); /* reject inode numbers outside existing AGs */ @@ -758,15 +738,7 @@ xfs_inode_walk_ag_grab( ASSERT(rcu_read_lock_held()); - /* - * check for stale RCU freed inode - * - * If the inode has been reallocated, it doesn't matter if it's not in - * the AG we are walking - we are walking for writeback, so if it - * passes all the "valid inode" checks and is dirty, then we'll write - * it back anyway. If it has been reallocated and still being - * initialised, the XFS_INEW check below will catch it. - */ + /* Check for stale RCU freed inode */ spin_lock(&ip->i_flags_lock); if (!ip->i_ino) goto out_unlock_noent; @@ -1044,43 +1016,16 @@ xfs_reclaim_inode_grab( } /* - * Inodes in different states need to be treated differently. The following - * table lists the inode states and the reclaim actions necessary: - * - * inode state iflush ret required action - * --------------- ---------- --------------- - * bad - reclaim - * shutdown EIO unpin and reclaim - * clean, unpinned 0 reclaim - * stale, unpinned 0 reclaim - * clean, pinned(*) 0 requeue - * stale, pinned EAGAIN requeue - * dirty, async - requeue - * dirty, sync 0 reclaim + * Inode reclaim is non-blocking, so the default action if progress cannot be + * made is to "requeue" the inode for reclaim by unlocking it and clearing the + * XFS_IRECLAIM flag. If we are in a shutdown state, we don't care about + * blocking anymore and hence we can wait for the inode to be able to reclaim + * it. * - * (*) dgc: I don't think the clean, pinned state is possible but it gets - * handled anyway given the order of checks implemented. - * - * Also, because we get the flush lock first, we know that any inode that has - * been flushed delwri has had the flush completed by the time we check that - * the inode is clean. - * - * Note that because the inode is flushed delayed write by AIL pushing, the - * flush lock may already be held here and waiting on it can result in very - * long latencies. Hence for sync reclaims, where we wait on the flush lock, - * the caller should push the AIL first before trying to reclaim inodes to - * minimise the amount of time spent waiting. For background relaim, we only - * bother to reclaim clean inodes anyway. - * - * Hence the order of actions after gaining the locks should be: - * bad => reclaim - * shutdown => unpin and reclaim - * pinned, async => requeue - * pinned, sync => unpin - * stale => reclaim - * clean => reclaim - * dirty, async => requeue - * dirty, sync => flush, wait and reclaim + * We do no IO here - if callers require inodes to be cleaned they must push the + * AIL first to trigger writeback of dirty inodes. This enables writeback to be + * done in the background in a non-blocking manner, and enables memory reclaim + * to make progress without blocking. */ static void xfs_reclaim_inode( @@ -1271,13 +1216,11 @@ xfs_reclaim_inodes( } /* - * Scan a certain number of inodes for reclaim. - * - * When called we make sure that there is a background (fast) inode reclaim in - * progress, while we will throttle the speed of reclaim via doing synchronous - * reclaim of inodes. That means if we come across dirty inodes, we wait for - * them to be cleaned, which we hope will not be very long due to the - * background walker having already kicked the IO off on those dirty inodes. + * The shrinker infrastructure determines how many inodes we should scan for + * reclaim. We want as many clean inodes ready to reclaim as possible, so we + * push the AIL here. We also want to proactively free up memory if we can to + * minimise the amount of work memory reclaim has to do so we kick the + * background reclaim if it isn't already scheduled. */ long xfs_reclaim_inodes_nr( @@ -1390,8 +1333,7 @@ xfs_inode_matches_eofb( * This is a fast pass over the inode cache to try to get reclaim moving on as * many inodes as possible in a short period of time. It kicks itself every few * seconds, as well as being kicked by the inode cache shrinker when memory - * goes low. It scans as quickly as possible avoiding locked inodes or those - * already being flushed, and once done schedules a future pass. + * goes low. */ void xfs_reclaim_worker( -- cgit From 48d55e2ae3ce837598c073995bbbac5d24a35fe1 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:18 -0700 Subject: xfs: attach inodes to the cluster buffer when dirtied Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 1 + 1 file changed, 1 insertion(+) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index dc90a81abb1a..58a750ce689c 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -115,6 +115,7 @@ __xfs_inode_free( { /* asserts to verify all state is correct here */ ASSERT(atomic_read(&ip->i_pincount) == 0); + ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list)); XFS_STATS_DEC(ip->i_mount, vn_active); call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback); -- cgit From 8cd4901da56caadc16b4e8d6b434291a8ce31d7c Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Wed, 15 Jul 2020 17:42:36 -0700 Subject: xfs: rename XFS_DQ_{USER,GROUP,PROJ} to XFS_DQTYPE_* We're going to split up the incore dquot state flags from the ondisk dquot flags (eventually renaming this "type") so start by renaming the three flags and the bitmask that are going to participate in this. Signed-off-by: Darrick J. Wong Reviewed-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_icache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 58a750ce689c..3c6e936d2f99 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1424,7 +1424,7 @@ __xfs_inode_free_quota_eofblocks( eofb.eof_flags = XFS_EOF_FLAGS_UNION|XFS_EOF_FLAGS_SYNC; if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) { - dq = xfs_inode_dquot(ip, XFS_DQ_USER); + dq = xfs_inode_dquot(ip, XFS_DQTYPE_USER); if (dq && xfs_dquot_lowsp(dq)) { eofb.eof_uid = VFS_I(ip)->i_uid; eofb.eof_flags |= XFS_EOF_FLAGS_UID; @@ -1433,7 +1433,7 @@ __xfs_inode_free_quota_eofblocks( } if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) { - dq = xfs_inode_dquot(ip, XFS_DQ_GROUP); + dq = xfs_inode_dquot(ip, XFS_DQTYPE_GROUP); if (dq && xfs_dquot_lowsp(dq)) { eofb.eof_gid = VFS_I(ip)->i_gid; eofb.eof_flags |= XFS_EOF_FLAGS_GID; -- cgit From 3050bd0bfe706381c36e4b48bf4de465b0ab94f7 Mon Sep 17 00:00:00 2001 From: Carlos Maiolino Date: Wed, 22 Jul 2020 09:23:04 -0700 Subject: xfs: Remove kmem_zone_alloc() usage Use kmem_cache_alloc() directly. All kmem_zone_alloc() users pass 0 as flags, which are translated into: GFP_KERNEL | __GFP_NOWARN, and kmem_zone_alloc() loops forever until the allocation succeeds. We can use __GFP_NOFAIL to tell the allocator to loop forever rather than doing it ourself, and because the allocation will never fail, we do not need to use __GFP_NOWARN anymore. Hence, all callers can be converted to use GFP_KERNEL | __GFP_NOFAIL Signed-off-by: Carlos Maiolino Reviewed-by: Darrick J. Wong [darrick: add a comment back in about nofail] Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Dave Chinner --- fs/xfs/xfs_icache.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'fs/xfs/xfs_icache.c') diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 3c6e936d2f99..101028ebb571 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -37,13 +37,11 @@ xfs_inode_alloc( struct xfs_inode *ip; /* - * if this didn't occur in transactions, we could use - * KM_MAYFAIL and return NULL here on ENOMEM. Set the - * code up to do this anyway. + * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL + * and return NULL here on ENOMEM. */ - ip = kmem_zone_alloc(xfs_inode_zone, 0); - if (!ip) - return NULL; + ip = kmem_cache_alloc(xfs_inode_zone, GFP_KERNEL | __GFP_NOFAIL); + if (inode_init_always(mp->m_super, VFS_I(ip))) { kmem_cache_free(xfs_inode_zone, ip); return NULL; -- cgit