summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2015-04-12ext4 crypto: filename encryption facilitiesMichael Halcrow
Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: implement the ext4 decryption read pathMichael Halcrow
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: implement the ext4 encryption write pathMichael Halcrow
Pulls block_write_begin() into fs/ext4/inode.c because it might need to do a low-level read of the existing data, in which case we need to decrypt it. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: inherit encryption policies on inode and directory createMichael Halcrow
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: enforce context consistencyTheodore Ts'o
Enforce the following inheritance policy: 1) An unencrypted directory may contain encrypted or unencrypted files or directories. 2) All files or directories in a directory must be protected using the same key as their containing directory. As a result, assuming the following setup: mke2fs -t ext4 -Fq -O encrypt /dev/vdc mount -t ext4 /dev/vdc /vdc mkdir /vdc/a /vdc/b /vdc/c echo foo | e4crypt add_key /vdc/a echo bar | e4crypt add_key /vdc/b for i in a b c ; do cp /etc/motd /vdc/$i/motd-$i ; done Then we will see the following results: cd /vdc mv a b # will fail; /vdc/a and /vdc/b have different keys mv b/motd-b a # will fail, see above ln a/motd-a b # will fail, see above mv c a # will fail; all inodes in an encrypted directory # must be encrypted ln c/motd-c b # will fail, see above mv a/motd-a c # will succeed mv c/motd-a a # will succeed Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: add encryption key management facilitiesMichael Halcrow
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12ext4 crypto: add ext4 encryption facilitiesMichael Halcrow
On encrypt, we will re-assign the buffer_heads to point to a bounce page rather than the control_page (which is the original page to write that contains the plaintext). The block I/O occurs against the bounce page. On write completion, we re-assign the buffer_heads to the original plaintext page. On decrypt, we will attach a read completion callback to the bio struct. This read completion will decrypt the read contents in-place prior to setting the page up-to-date. The current encryption mode, AES-256-XTS, lacks cryptographic integrity. AES-256-GCM is in-plan, but we will need to devise a mechanism for handling the integrity data. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11mirror O_APPEND and O_DIRECT into iocb->ki_flagsAl Viro
... avoiding write_iter/fcntl races. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch generic_write_checks() to iocb and iterAl Viro
... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c
2015-04-11ext4_file_write_iter: move generic_write_checks() upAl Viro
simpler that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11generic_write_checks(): drop isblk argumentAl Viro
all remaining callers are passing 0; some just obscure that fact. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11lift generic_write_checks() into callers of __generic_file_write_iter()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: remove rw from a_ops->direct_IO()Omar Sandoval
Now that no one is using rw, remove it completely. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: use iov_iter_rw() instead of rw everywhereOmar Sandoval
The rw parameter to direct_IO is redundant with iov_iter->type, and treated slightly differently just about everywhere it's used: some users do rw & WRITE, and others do rw == WRITE where they should be doing a bitwise check. Simplify this with the new iov_iter_rw() helper, which always returns either READ or WRITE. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Remove rw from dax_{do_,}io()Omar Sandoval
And use iov_iter_rw() instead. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Remove rw from {,__,do_}blockdev_direct_IO()Omar Sandoval
Most filesystems call through to these at some point, so we'll start here. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11make new_sync_{read,write}() staticAl Viro
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Merge branch 'iocb' into for-nextAl Viro
2015-04-11ext4 crypto: add encryption policy and password salt supportMichael Halcrow
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
2015-04-11ext4 crypto: add encryption xattr supportMichael Halcrow
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11ext4 crypto: export ext4_empty_dir()Michael Halcrow
Required for future encryption xattr changes. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11ext4 crypto: add ext4 encryption KconfigTheodore Ts'o
Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-11ext4 crypto: reserve codepoints used by the ext4 encryption featureTheodore Ts'o
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-08ext4 crypto: add ext4_mpage_readpages()Theodore Ts'o
This takes code from fs/mpage.c and optimizes it for ext4. Its primary reason is to allow us to more easily add encryption to ext4's read path in an efficient manner. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03ext4: make fsync to sync parent dir in no-journal for real this timeLukas Czerner
Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a support for for syncing parent directory of newly created inodes to make sure that the inode is not lost after a power failure in no-journal mode. However this does not work in majority of cases, namely: - if the directory has inline data - if the directory is already indexed - if the directory already has at least one block and: - the new entry fits into it - or we've successfully converted it to indexed So in those cases we might lose the inode entirely even after fsync in the no-journal mode. This also includes ext2 default mode obviously. I've noticed this while running xfstest generic/321 and even though the test should fail (we need to run fsck after a crash in no-journal mode) I could not find a newly created entries even when if it was fsynced before. Fix this by adjusting the ext4_add_entry() successful exit paths to set the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the parent directory as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Frank Mayhar <fmayhar@google.com> Cc: stable@vger.kernel.org
2015-04-03ext4: don't release reserved space for previously allocated clusterEric Whitney
When xfstests' auto group is run on a bigalloc filesystem with a 4.0-rc3 kernel, e2fsck failures and kernel warnings occur for some tests. e2fsck reports incorrect iblocks values, and the warnings indicate that the space reserved for delayed allocation is being overdrawn at allocation time. Some of these errors occur because the reserved space is incorrectly decreased by one cluster when ext4_ext_map_blocks satisfies an allocation request by mapping an unused portion of a previously allocated cluster. Because a cluster's worth of reserved space was already released when it was first allocated, it should not be released again. This patch appears to correct the e2fsck failure reported for generic/232 and the kernel warnings produced by ext4/001, generic/009, and generic/033. Failures and warnings for some other tests remain to be addressed. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03ext4: fix loss of delalloc extent info in ext4_zero_range()Eric Whitney
In ext4_zero_range(), removing a file's entire block range from the extent status tree removes all records of that file's delalloc extents. The delalloc accounting code uses this information, and its loss can then lead to accounting errors and kernel warnings at writeback time and subsequent file system damage. This is most noticeable on bigalloc file systems where code in ext4_ext_map_blocks() handles cases where delalloc extents share clusters with a newly allocated extent. Because we're not deleting a block range and are correctly updating the status of its associated extent, there is no need to remove anything from the extent status tree. When this patch is combined with an unrelated bug fix for ext4_zero_range(), kernel warnings and e2fsck errors reported during xfstests runs on bigalloc filesystems are greatly reduced without introducing regressions on other xfstests-bld test scenarios. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03ext4: allocate entire range in zero rangeLukas Czerner
Currently there is a bug in zero range code which causes zero range calls to only allocate block aligned portion of the range, while ignoring the rest in some cases. In some cases, namely if the end of the range is past i_size, we do attempt to preallocate the last nonaligned block. However this might cause kernel to BUG() in some carefully designed zero range requests on setups where page size > block size. Fix this problem by first preallocating the entire range, including the nonaligned edges and converting the written extents to unwritten in the next step. This approach will also give us the advantage of having the range to be as linearly contiguous as possible. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-03ext4: remove unnecessary lock/unlock of i_block_reservation_lockMaurizio Lombardi
This is a leftover of commit 71d4f7d032149b935a26eb3ff85c6c837f3714e1 Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-04-02ext4: remove block_device_ejectedChristoph Hellwig
bdi->dev now never goes away, so this function became useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02ext4: remove useless condition in if statement.Wei Yuan
In this if statement, the previous condition is useless, the later one has covered it. Signed-off-by: Weiyuan <weiyuan.wei@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-04-02ext4: remove unused header filesSheng Yong
Remove unused header files and header files which are included in ext4.h. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02ext4: fix comments in ext4_can_extents_be_merged()Xiaoguang Wang
Since commit a9b8241594add, we are allowed to merge unwritten extents, so here these comments are wrong, remove it. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02ext4: fix transposition typo in format stringRasmus Villemoes
According to C99, %*.s means the same as %*.0s, in other words, print as many spaces as the field width argument says and effectively ignore the string argument. That is certainly not what was meant here. The kernel's printf implementation, however, treats it as if the . was not there, i.e. as %*s. I don't know if de->name is nul-terminated or not, but in any case I'm guessing the intention was to use de->name_len as precision instead of field width. [ Note: this is debugging code which is commented out, so this is not security issue; a developer would have to explicitly enable INLINE_DIR_DEBUG before this would be an issue. ] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02ext4: fix bh leak on error paths in ext4_rename() and ext4_cross_rename()Konstantin Khlebnikov
Release references to buffer-heads if ext4_journal_start() fails. Fixes: 5b61de757535 ("ext4: start handle at least possible moment when renaming files") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2015-03-25fs: move struct kiocb to fs.hChristoph Hellwig
struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-04quota: Make VFS quotas use new interface for getting quota infoJan Kara
Create new internal interface for getting information about quota which contains everything needed for both VFS quotas and XFS quotas. Make VFS use this and hook it up to Q_GETINFO. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2015-02-22Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Ext4 bug fixes. We also reserved code points for encryption and read-only images (for which the implementation is mostly just the reserved code point for a read-only feature :-)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix indirect punch hole corruption ext4: ignore journal checksum on remount; don't fail ext4: remove duplicate remount check for JOURNAL_CHECKSUM change ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize ext4: support read-only images ext4: change to use setup_timer() instead of init_timer() ext4: reserve codepoints used by the ext4 encryption feature jbd2: complain about descriptor block checksum errors
2015-02-17Merge branch 'lazytime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull lazytime mount option support from Al Viro: "Lazytime stuff from tytso" * 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ext4: add optimization for the lazytime mount option vfs: add find_inode_nowait() function vfs: add support for a lazytime mount option
2015-02-16ext4: add DAX functionalityRoss Zwisler
This is a port of the DAX functionality found in the current version of ext2. [matthew.r.wilcox@intel.com: heavily tweaked] [akpm@linux-foundation.org: remap_pages went away] Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-14ext4: fix indirect punch hole corruptionOmar Sandoval
Commit 4f579ae7de56 (ext4: fix punch hole on files with indirect mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect mapping. However, there are bugs in several corner cases. This fixes 5 distinct bugs: 1. When there is at least one entire level of indirection between the start and end of the punch range and the end of the punch range is the first block of its level, we can't return early; we have to free the intervening levels. 2. When the end is at a higher level of indirection than the start and ext4_find_shared returns a top branch for the end, we still need to free the rest of the shared branch it returns; we can't decrement partial2. 3. When a punch happens within one level of indirection, we need to converge on an indirect block that contains the start and end. However, because the branches returned from ext4_find_shared do not necessarily start at the same level (e.g., the partial2 chain will be shallower if the last block occurs at the beginning of an indirect group), the walk of the two chains can end up "missing" each other and freeing a bunch of extra blocks in the process. This mismatch can be handled by first making sure that the chains are at the same level, then walking them together until they converge. 4. When the punch happens within one level of indirection and ext4_find_shared returns a top branch for the start, we must free it, but only if the end does not occur within that branch. 5. When the punch happens within one level of indirection and ext4_find_shared returns a top branch for the end, then we shouldn't free the block referenced by the end of the returned chain (this mirrors the different levels case). Signed-off-by: Omar Sandoval <osandov@osandov.com>
2015-02-12ext4: ignore journal checksum on remount; don't failEric Sandeen
As of v3.18, ext4 started rejecting a remount which changes the journal_checksum option. Prior to that, it was simply ignored; the problem here is that if someone has this in their fstab for the root fs, now the box fails to boot properly, because remount of root with the new options will fail, and the box proceeds with a readonly root. I think it is a little nicer behavior to accept the option, but warn that it's being ignored, rather than failing the mount, but that might be a subjective matter... Reported-by: Cónräd <conradsand.arma@gmail.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12ext4: remove duplicate remount check for JOURNAL_CHECKSUM changeEric Sandeen
rejection of, changing journal_checksum during remount. One suffices. While we're at it, remove old comment about the "check" option which has been deprecated for some time now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesizeXiaoguang Wang
Since commit 90a8020 and d6320cb, Jan Kara has fixed this issue partially. This mmap data corruption still exists in nodelalloc mode, fix this. Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2015-02-12ext4: support read-only imagesDarrick J. Wong
Add a rocompat feature, "readonly" to mark a FS image as read-only. The feature prevents the kernel and e2fsprogs from changing the image; the flag can be toggled by tune2fs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
2015-02-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "Bite-sized chunks this time, to avoid the MTA ratelimiting woes. - fs/notify updates - ocfs2 - some of MM" That laconic "some MM" is mainly the removal of remap_file_pages(), which is a big simplification of the VM, and which gets rid of a *lot* of random cruft and special cases because we no longer support the non-linear mappings that it used. From a user interface perspective, nothing has changed, because the remap_file_pages() syscall still exists, it's just done by emulating the old behavior by creating a lot of individual small mappings instead of one non-linear one. The emulation is slower than the old "native" non-linear mappings, but nobody really uses or cares about remap_file_pages(), and simplifying the VM is a big advantage. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits) memcg: zap memcg_slab_caches and memcg_slab_mutex memcg: zap memcg_name argument of memcg_create_kmem_cache memcg: zap __memcg_{charge,uncharge}_slab mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check mm: hugetlb: fix type of hugetlb_treat_as_movable variable mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? mm: memory: merge shared-writable dirtying branches in do_wp_page() mm: memory: remove ->vm_file check on shared writable vmas xtensa: drop _PAGE_FILE and pte_file()-related helpers x86: drop _PAGE_FILE and pte_file()-related helpers unicore32: drop pte_file()-related helpers um: drop _PAGE_FILE and pte_file()-related helpers tile: drop pte_file()-related helpers sparc: drop pte_file()-related helpers sh: drop _PAGE_FILE and pte_file()-related helpers score: drop _PAGE_FILE and pte_file()-related helpers s390: drop pte_file()-related helpers parisc: drop _PAGE_FILE and pte_file()-related helpers openrisc: drop _PAGE_FILE and pte_file()-related helpers nios2: drop _PAGE_FILE and pte_file()-related helpers ...
2015-02-10mm: drop vm_ops->remap_pages and generic_file_remap_pages() stubKirill A. Shutemov
Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-05ext4: add optimization for the lazytime mount optionTheodore Ts'o
Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-05vfs: add support for a lazytime mount optionTheodore Ts'o
Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>