From 54752de928c40028e5575ab6b49be9ec3c24b466 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Thu, 23 Jul 2020 22:45:58 -0700 Subject: iomap: Only invalidate page cache pages on direct IO writes The historic requirement for XFS to invalidate cached pages on direct IO reads has been lost in the twisty pages of history - it was inherited from Irix, which implemented page cache invalidation on read as a method of working around problems synchronising page cache state with uncached IO. XFS has carried this ever since. In the initial linux ports it was necessary to get mmap and DIO to play "ok" together and not immediately corrupt data. This was the state of play until the linux kernel had infrastructure to track unwritten extents and synchronise page faults with allocations and unwritten extent conversions (->page_mkwrite infrastructure). IOws, the page cache invalidation on DIO read was necessary to prevent trivial data corruptions. This didn't solve all the problems, though. There were peformance problems if we didn't invalidate the entire page cache over the file on read - we couldn't easily determine if the cached pages were over the range of the IO, and invalidation required taking a serialising lock (i_mutex) on the inode. This serialising lock was an issue for XFS, as it was the only exclusive lock in the direct Io read path. Hence if there were any cached pages, we'd just invalidate the entire file in one go so that subsequent IOs didn't need to take the serialising lock. This was a problem that prevented ranged invalidation from being particularly useful for avoiding the remaining coherency issues. This was solved with the conversion of i_mutex to i_rwsem and the conversion of the XFS inode IO lock to use i_rwsem. Hence we could now just do ranged invalidation and the performance problem went away. However, page cache invalidation was still needed to serialise sub-page/sub-block zeroing via direct IO against buffered IO because bufferhead state attached to the cached page could get out of whack when direct IOs were issued. We've removed bufferheads from the XFS code, and we don't carry any extent state on the cached pages anymore, and so this problem has gone away, too. IOWs, it would appear that we don't have any good reason to be invalidating the page cache on DIO reads anymore. Hence remove the invalidation on read because it is unnecessary overhead, not needed to maintain coherency between mmap/buffered access and direct IO anymore, and prevents anyone from using direct IO reads from intentionally invalidating the page cache of a file. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Matthew Wilcox (Oracle) Signed-off-by: Christoph Hellwig Signed-off-by: Darrick J. Wong --- fs/iomap/direct-io.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) (limited to 'fs/iomap/direct-io.c') diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index ec7b78e6feca..190967e87b69 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -475,23 +475,22 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (ret) goto out_free_dio; - /* - * Try to invalidate cache pages for the range we're direct - * writing. If this invalidation fails, tough, the write will - * still work, but racing two incompatible write paths is a - * pretty crazy thing to do, so we don't support it 100%. - */ - ret = invalidate_inode_pages2_range(mapping, - pos >> PAGE_SHIFT, end >> PAGE_SHIFT); - if (ret) - dio_warn_stale_pagecache(iocb->ki_filp); - ret = 0; + if (iov_iter_rw(iter) == WRITE) { + /* + * Try to invalidate cache pages for the range we are writing. + * If this invalidation fails, tough, the write will still work, + * but racing two incompatible write paths is a pretty crazy + * thing to do, so we don't support it 100%. + */ + if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, + end >> PAGE_SHIFT)) + dio_warn_stale_pagecache(iocb->ki_filp); - if (iov_iter_rw(iter) == WRITE && !wait_for_completion && - !inode->i_sb->s_dio_done_wq) { - ret = sb_init_dio_done_wq(inode->i_sb); - if (ret < 0) - goto out_free_dio; + if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) { + ret = sb_init_dio_done_wq(inode->i_sb); + if (ret < 0) + goto out_free_dio; + } } inode_dio_begin(inode); -- cgit From 60263d5889e6dc5987dc51b801be4955ff2e4aa7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 23 Jul 2020 22:45:59 -0700 Subject: iomap: fall back to buffered writes for invalidation failures Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: Christoph Hellwig Acked-by: Dave Chinner Reviewed-by: Goldwyn Rodrigues Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Acked-by: Bob Peterson Acked-by: Damien Le Moal Reviewed-by: Theodore Ts'o # for ext4 Reviewed-by: Andreas Gruenbacher # for gfs2 Reviewed-by: Ritesh Harjani --- fs/iomap/direct-io.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) (limited to 'fs/iomap/direct-io.c') diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 190967e87b69..c1aafb2ab990 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -10,6 +10,7 @@ #include #include #include +#include "trace.h" #include "../internal.h" @@ -401,6 +402,9 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length, * can be mapped into multiple disjoint IOs and only a subset of the IOs issued * may be pure data writes. In that case, we still need to do a full data sync * completion. + * + * Returns -ENOTBLK In case of a page invalidation invalidation failure for + * writes. The callers needs to fall back to buffered I/O in this case. */ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, @@ -478,13 +482,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (iov_iter_rw(iter) == WRITE) { /* * Try to invalidate cache pages for the range we are writing. - * If this invalidation fails, tough, the write will still work, - * but racing two incompatible write paths is a pretty crazy - * thing to do, so we don't support it 100%. + * If this invalidation fails, let the caller fall back to + * buffered I/O. */ if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, - end >> PAGE_SHIFT)) - dio_warn_stale_pagecache(iocb->ki_filp); + end >> PAGE_SHIFT)) { + trace_iomap_dio_invalidate_fail(inode, pos, count); + ret = -ENOTBLK; + goto out_free_dio; + } if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) { ret = sb_init_dio_done_wq(inode->i_sb); -- cgit