xfs: Don't use unwritten extents for DAX

DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
author: Dave Chinner <dchinner@redhat.com> 2015-11-03 12:37:00 +1100
committer: Dave Chinner <david@fromorbit.com> 2015-11-03 12:37:00 +1100
commit: 1ca191576fc862b4766f58e41aa362b28a7c1866 (patch)
tree: 07b9e420aae07600a11da6bac81477e15194cdfc /fs/xfs/xfs_aops.c
parent: 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67 (diff)
1 files changed, 9 insertions, 4 deletions
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e747d6ad5d18..df3dabd469b9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1284,15 +1284,12 @@ xfs_map_direct(
 
 	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
 
-	/* XXX: preparation for removing unwritten extents in DAX */
-#if 0
 	if (dax_fault) {
 		ASSERT(type == XFS_IO_OVERWRITE);
 		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
 					    imap);
 		return;
 	}
-#endif
 
 	if (bh_result->b_private) {
 		ioend = bh_result->b_private;
@@ -1420,10 +1417,12 @@ __xfs_get_blocks(
 	if (error)
 		goto out_unlock;
 
+	/* for DAX, we convert unwritten extents directly */
 	if (create &&
 	    (!nimaps ||
 	     (imap.br_startblock == HOLESTARTBLOCK ||
-	      imap.br_startblock == DELAYSTARTBLOCK))) {
+	      imap.br_startblock == DELAYSTARTBLOCK) ||
+	     (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
 		if (direct || xfs_get_extsz_hint(ip)) {
 			/*
 			 * Drop the ilock in preparation for starting the block
@@ -1468,6 +1467,12 @@ __xfs_get_blocks(
 		goto out_unlock;
 	}
 
+	if (IS_DAX(inode) && create) {
+		ASSERT(!ISUNWRITTEN(&imap));
+		/* zeroing is not needed at a higher layer */
+		new = 0;
+	}
+
 	/* trim mapping down to size requested */
 	if (direct || size > (1 << inode->i_blkbits))
 		xfs_map_trim_size(inode, iblock, bh_result,
author	Dave Chinner <dchinner@redhat.com>	2015-11-03 12:37:00 +1100
committer	Dave Chinner <david@fromorbit.com>	2015-11-03 12:37:00 +1100
commit	1ca191576fc862b4766f58e41aa362b28a7c1866 (patch)
tree	07b9e420aae07600a11da6bac81477e15194cdfc /fs/xfs/xfs_aops.c
parent	3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67 (diff)