summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-08-09Btrfs: teach backref walking about backrefs with underflowed offset valuesFilipe Manana
When cloning/deduplicating file extents (through the clone and extent_same ioctls) we can get data back references with offset values that are a result of an unsigned integer arithmetic underflow, that is, values that are much larger then they could be otherwise. This is not a problem when decrementing or dropping the back references (happens when we overwrite the extents or punch a hole for example, through __btrfs_drop_extents()), since we compute the same too large offset value, but it is a problem for the backref walking code, used by an incremental send and the ioctls that are used by the btrfs tool "inspect-internal" commands, as it makes it miss the corresponding file extent items because the search key is set for an extent item that starts at an offset matching the exceptionally large offset value of the data back reference. For an incremental send this causes the send ioctl to fail with -EIO. So teach the backref walking code to deal with these cases by setting the search key's offset to 0 if the backref's offset value is larger than LLONG_MAX (the largest possible file offset). This makes sure the backref walking code finds the corresponding file extent items at the expense of scanning more items and leafs in the btree. Fixing the clone/dedup ioctls to not produce such underflowed results would require major changes breaking backward compatibility, updating user space tools, etc. Simple reproducer case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _need_to_be_root send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single extent of 64K starting at file # offset 128K. $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \ | _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap1 # Now clone parts of the original extent into lower offsets of the file. # # The first clone operation adds a file extent item to file offset 0 # that points to our initial extent with a data offset of 16K. The # corresponding data back reference in the extent tree has an offset of # 18446744073709535232, which is the result of file_offset - data_offset # = 0 - 16K. # # The second clone operation adds a file extent item to file offset 16K # that points to our initial extent with a data offset of 48K. The # corresponding data back reference in the extent tree has an offset of # 18446744073709518848, which is the result of file_offset - data_offset # = 16K - 48K. # # Those large back reference offsets (result of unsigned arithmetic # underflow) confused the back reference walking code (used by an # incremental send and the multiple inspect-internal ioctls) and made it # miss the back references, which for the case of an incremental send it # made it fail with -EIO and print a message like the following to # dmesg: # # "BTRFS error (device sdc): did not find backref in send_root. \ # inode=257, offset=0, disk_byte=12845056 found extent=12845056" # $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \ -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap2 _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ -f $send_files_dir/2.snap echo "File digest in the original filesystem:" md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch # Now recreate the filesystem by receiving both send streams and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap echo "File digest in the new filesystem:" md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch status=0 exit The test's expected golden output is: wrote 65536/65536 bytes at offset 131072 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File digest in the original filesystem: 6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo File digest in the new filesystem: 6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo But it failed with: (...) @@ -1,7 +1,5 @@ QA output created by 097 wrote 65536/65536 bytes at offset 131072 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -File digest in the original filesystem: -6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo -File digest in the new filesystem: -6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo ... $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full (...) ERROR: send ioctl failed with -5: Input/output error Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix stale dir entries after unlink, inode eviction and fsyncFilipe Manana
If we remove a hard link from an inode, the inode gets evicted, then we fsync the inode and then power fail/crash, when the log tree is replayed, the parent directory inode still has entries pointing to the name that no longer exists, while our inode no longer has the BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected), leaving the filesystem in an inconsistent state. The stale directory entries can not be deleted (an attempt to delete them causes -ESTALE errors), which makes it impossible to delete the parent directory. This happens because we track the id of the transaction where the last unlink operation for the inode happened (last_unlink_trans) in an in-memory only field of the inode, that is, a value that is never persisted in the inode item stored on the fs/subvol btree. So if an inode is evicted and loaded again, the value for last_unlink_trans is set to 0, which prevents the fsync from logging the parent directory at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans to the id of the transaction that last modified the inode when we load the inode. This is a pessimistic approach but it always ensures correctness with the trade off of ocassional full transaction commits when an fsync is done against the inode in the same transaction where it was evicted and reloaded when our inode is a directory and often logging its parent unnecessarily when our inode is not a directory. The following test case for fstests triggers the problem: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with 2 hard links. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/testdir/foo ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar # Make sure everything done so far is durably persisted. sync # Now remove one of the links, trigger inode eviction and then fsync # our inode. unlink $SCRATCH_MNT/testdir/bar echo 2 > /proc/sys/vm/drop_caches $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo # Silently drop all writes on our scratch device to simulate a power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount the fs to trigger log/journal replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now verify our directory entries. echo "Entries in testdir:" ls -1 $SCRATCH_MNT/testdir # If we remove our inode, its parent should become empty and therefore we should # be able to remove the parent. rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey # The fstests framework will call fsck against our filesystem which will verify # that all metadata is in a consistent state. status=0 exit The test failed on btrfs with: generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad) --- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100 @@ -1,3 +1,6 @@ QA output created by 098 Entries in testdir: +bar foo +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty ... (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full) $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full (...) checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref Checking filesystem on /dev/sdc (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix stale directory entries after fsync log replayFilipe Manana
We have another case where after an fsync log replay we get an inode with a wrong link count (smaller than it should be) and a number of directory entries greater than its link count. This happens when we add a new link hard link to our inode A and then we fsync some other inode B that has the side effect of logging the parent directory inode too. In this case at log replay time we add the new hard link to our inode (the item with key BTRFS_INODE_REF_KEY) when processing the parent directory but we never adjust the link count of our inode A. As a result we get stale dir entries for our inode A that can never be deleted and therefore it makes it impossible to remove the parent directory (as its i_size can never decrease back to 0). A simple reproducer for fstests that triggers this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test directory and files. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/testdir/foo touch $SCRATCH_MNT/testdir/bar # Make sure everything done so far is durably persisted. sync # Create one hard link for file foo and another one for file bar. After # that fsync only the file bar. ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar # Silently drop all writes on scratch device to simulate power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount the fs to trigger log/journal replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now verify both our files have a link count of 2. echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)" echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)" # We should be able to remove all the links of our files in testdir, and # after that the parent directory should become empty and therefore # possible to remove it. rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey # The fstests framework will call fsck against our filesystem which will verify # that all metadata is in a consistent state. status=0 exit The test fails with: -Link count for file foo: 2 +Link count for file foo: 1 Link count for file bar: 2 +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty (...) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent And fsck's output: (...) checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref Checking filesystem on /dev/sdc (...) So fix this by marking inodes for link count fixup at log replay time whenever a directory entry is replayed if the entry was created in the transaction where the fsync was made and if it points to a non-directory inode. This isn't a new problem/regression, the issue exists for a long time, possibly since the log tree feature was added (2008). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "We have a btrfs quota regression fix. I merged this one on Thursday and have run it through tests against current master. Normally I wouldn't have sent this while you were finalizing rc6, but I'm feeding mosquitoes in the adirondacks next week, so I wanted to get this one out before leaving. I'll leave longer tests running and check on things during the week, but I don't expect any problems" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: qgroup: Fix a regression in qgroup reserved space.
2015-08-07treewide: fix typos in comment blocksMasahiro Yamada
Looks like the word "contiguous" is often mistyped. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07Btrfs: Spelling s/consitent/consistent/Geert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07ntfs: super.c: Fix error logNik Nyby
"transation" should be "transaction" Signed-off-by: Nik Nyby <nikolas@gnu.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07freevxfs: Grammar s/an negative/a negative/Geert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07treewide: Fix typo in printkMasanari Iida
This patch fix spelling typo inv various part of sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07ipc: use private shmem or hugetlbfs inodes for shm segments.Stephen Smalley
The shm implementation internally uses shmem or hugetlbfs inodes for shm segments. As these inodes are never directly exposed to userspace and only accessed through the shm operations which are already hooked by security modules, mark the inodes with the S_PRIVATE flag so that inode security initialization and permission checking is skipped. This was motivated by the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W ------------------------------------------------------- httpd/1597 is trying to acquire lock: (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130 but task is already holding lock: (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++++}: lock_acquire+0xc7/0x270 __might_fault+0x7a/0xa0 filldir+0x9e/0x130 xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs] xfs_readdir+0x1b4/0x330 [xfs] xfs_file_readdir+0x2b/0x30 [xfs] iterate_dir+0x97/0x130 SyS_getdents+0x91/0x120 entry_SYSCALL_64_fastpath+0x12/0x76 -> #2 (&xfs_dir_ilock_class){++++.+}: lock_acquire+0xc7/0x270 down_read_nested+0x57/0xa0 xfs_ilock+0x167/0x350 [xfs] xfs_ilock_attr_map_shared+0x38/0x50 [xfs] xfs_attr_get+0xbd/0x190 [xfs] xfs_xattr_get+0x3d/0x70 [xfs] generic_getxattr+0x4f/0x70 inode_doinit_with_dentry+0x162/0x670 sb_finish_set_opts+0xd9/0x230 selinux_set_mnt_opts+0x35c/0x660 superblock_doinit+0x77/0xf0 delayed_superblock_init+0x10/0x20 iterate_supers+0xb3/0x110 selinux_complete_init+0x2f/0x40 security_load_policy+0x103/0x600 sel_write_load+0xc1/0x750 __vfs_write+0x37/0x100 vfs_write+0xa9/0x1a0 SyS_write+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 ... Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Morten Stevens <mstevens@fedoraproject.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ocfs2: fix shift left overflowJoseph Qi
When using a large volume, for example 9T volume with 2T already used, frequent creation of small files with O_DIRECT when the IO is not cluster aligned may clear sectors in the wrong place. This will cause filesystem corruption. This is because p_cpos is a u32. When calculating the corresponding sector it should be converted to u64 first, otherwise it may overflow. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara
fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and thus the next entry pointer we have cached may become stale and we dereference free memory. Fix the problem by first moving marks to free to a special private list and then always free the first entry in the special list. This method is safe even when entries from the list can disappear once we drop the lock. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07signalfd: fix information leak in signalfd_copyinfoAmanieu d'Antras
This function may copy the si_addr_lsb field to user mode when it hasn't been initialized, which can leak kernel stack data to user mode. Just checking the value of si_code is insufficient because the same si_code value is shared between multiple signals. This is solved by checking the value of si_signo in addition to si_code. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ocfs2: fix BUG in ocfs2_downconvert_thread_do_work()Joseph Qi
The "BUG_ON(list_empty(&osb->blocked_lock_list))" in ocfs2_downconvert_thread_do_work can be triggered in the following case: ocfs2dc has firstly saved osb->blocked_lock_count to local varibale processed, and then processes the dentry lockres. During the dentry put, it calls iput and then deletes rw, inode and open lockres from blocked list in ocfs2_mark_lockres_freeing. And this causes the variable `processed' to not reflect the number of blocked lockres to be processed, which triggers the BUG. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07fs, file table: reinit files_stat.max_files after deferred memory initialisationMel Gorman
Dave Hansen reported the following; My laptop has been behaving strangely with 4.2-rc2. Once I log in to my X session, I start getting all kinds of strange errors from applications and see this in my dmesg: VFS: file-max limit 8192 reached The problem is that the file-max is calculated before memory is fully initialised and miscalculates how much memory the kernel is using. This patch recalculates file-max after deferred memory initialisation. Note that using memory hotplug infrastructure would not have avoided this problem as the value is not recalculated after memory hot-add. 4.1: files_stat.max_files = 6582781 4.2-rc2: files_stat.max_files = 8192 4.2-rc2 patched: files_stat.max_files = 6562467 Small differences with the patch applied and 4.1 but not enough to matter. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Hansen <dave.hansen@intel.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Alex Ng <alexng@microsoft.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-06btrfs: qgroup: Fix a regression in qgroup reserved space.Qu Wenruo
During the change to new btrfs extent-oriented qgroup implement, due to it doesn't use the old __qgroup_excl_accounting() for exclusive extent, it didn't free the reserved bytes. The bug will cause limit function go crazy as the reserved space is never freed, increasing limit will have no effect and still cause EQOUT. The fix is easy, just free reserved bytes for newly created exclusive extent as what it does before. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-05f2fs: recover invalid/reserved block address for fsynced fileChao Yu
When testing with generic/101 in xfstests, error message outputed as below: --- tests/generic/101.out +++ results//generic/101.out.bad @@ -10,10 +10,14 @@ File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * -0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0372000 ... (Run 'diff -u tests/generic/101.out results/generic/101.out.bad' to see the entire diff) The test flow is like below: 1. pwrite foo -S 0xaa 0 64K 2. pwrite foo -S 0xbb 64K 61K 3. sync 4. truncate foo 64K 5. truncate foo 125K 6. fsync foo 7. flakey drop writes 8. umount After this test, we expect the data of recovered file will have the first 64k of data filling with value 0xaa and the next 61k of data filling with value 0x00 because we have fsynced it before dropping writes in dm. In f2fs, during recovering, we will only recover the valid block address in direct node page if it is marked as a fsynced dnode, but block address which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be recovered. So, the file recovered shows its incorrect data 0xbb in range of [61k, 125k]. In this patch, we fix to recover invalid/reserved block during recover flow. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use extent cache to optimize f2fs_reserve_blockFan Li
In some cases, we only need the block address when we call f2fs_reserve_block, other fields of struct dnode_of_data aren't necessary. We can try extent cache first for such cases in order to speed up the process. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05fs/char_dev.c: fix incorrect documentation for unregister_chrdev_regionPartha Pratim Mukherjee
The current documentation for unregister_chrdev_region says that it return a range of device numbers which is incorrect. Instead it unregister a range of device numbers. Fix the documentation to make this clear. Signed-off-by: Partha Pratim Mukherjee <ppm.floss@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05xprtrdma: Fix large NFS SYMLINK callsChuck Lever
Repair how rpcrdma_marshal_req() chooses which RDMA message type to use for large non-WRITE operations so that it picks RDMA_NOMSG in the correct situations, and sets up the marshaling logic to SEND only the RPC/RDMA header. Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS server XDR decoder for NFSv2 SYMLINK does not handle having the pathname argument arrive in a separate buffer. The decoder could be fixed, but this is simpler and RDMA_NOMSG can be used in a variety of other situations. Ensure that the Linux client continues to use "RDMA_MSG + read list" when sending large NFSv3 SYMLINK requests, which is more efficient than using RDMA_NOMSG. Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG + read list" just like NFSv3 (see Section 5 of RFC 5667). Before, these did not work at all. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05char: make misc_deregister a void functionGreg Kroah-Hartman
With well over 200+ users of this api, there are a mere 12 users that actually checked the return value of this function. And all of them really didn't do anything with that information as the system or module was shutting down no matter what. So stop pretending like it matters, and just return void from misc_deregister(). If something goes wrong in the call, you will get a WARNING splat in the syslog so you know how to fix up your driver. Other than that, there's nothing that can go wrong. Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05f2fs: invalidate temporary meta pageChao Yu
To avoid meeting garbage data in next free node block at the end of warm node chain when doing recovery, we will try to zero out that invalid block. If the device is not support discard, our way for zeroing out block is: grabbing a temporary zeroed page in meta inode, then, issue write request with this page. But, we forget to release that temporary page, so our memory usage will increase without gaining any hit ratio benefit, so it's better to free it for saving memory. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to release inode page correctlyChao Yu
In following call path, we will pass a locked and referenced ipage pointer to get_new_data_page: - init_inode_metadata - make_empty_dir - get_new_data_page There are two exit paths in get_new_data_page when error occurs: 1) grab_cache_page fails, ipage will not be released; 2) f2fs_reserve_block fails, ipage will be released in callee. So, it's not consistent for error handling in get_new_data_page. For f2fs_reserve_block, it's not very easy to change the rule of error handling, since it's already complicated. Here we deside to choose an easy way to fix this issue: If any error occur in get_new_data_page, we will ensure releasing ipage in this function. The same issue is in f2fs_convert_inline_dir, fix that too. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: unify f2fs_bug_on when check blocks and segmentLiu Xue
Replace BUG_ON with f2fs_bug_on to deal with block and segment validity check failed. Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: freeze filesystem when fail to update meta page due to IO errorChao Yu
In get_meta_page, we guarantee no failure for the returned page, but sometimes, IO error from device will incur returning an non-updated page. Then, we still use this page as updated one, exception could happen when using this kind of page. So in this condition, we'd better freeze fs by making fs readonly and and stop doing checkpoint. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: change the timing of f2fs_wait_on_page_writebackFan Li
some backing devices need pages to be stable during writeback. It doesn't matter if the page is completely overwritten or already uptodate, it needs to wait before write. Signed-off-by: Fan li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: handle error cases in commit_inmem_pagesJaegeuk Kim
This patch adds to handle error cases in commit_inmem_pages. If an error occurs, it stops to write the pages and return the error right away. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to build free nids from readaheaded nat pagesChao Yu
When there is no enough free nids in free nid cache, we will try to readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode, then, reading nat entries in nat page for adding free nids to free nid cache. But when traversing all nat pages we readaheaded in a circulation, our exit condition is not set right, one more nat page will be scanned without readaheading, resulting worse read performance. This patch fixes to read the correct number nat pages to avoid bad performance. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix inline data/dentry stat number leakChao Yu
If we clear inline data/dentry flag in handle_failed_inode, we will fail to decline the stat count of inline data/dentry in f2fs_evict_inode due to no flag in inode. So remove the wrong clearing. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: convert inline data before set atomic/volatile flagChao Yu
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting inline data, we will report error to user, but still remain atomic/volatile flag in inode, it will impact further writes for this file. Fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to wait all atomic written pages writebackChao Yu
This patch fixes the incorrect range (0, LONG_MAX) which is used in ranged fsync. If we use LONG_MAX as the parameter for indicating the end of file we want to synchronize, in 32-bits architecture machine, these datas after 4GB offset may not be persisted in storage after ->fsync returned. Here, we alter LONG_MAX to LLONG_MAX to fix this issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: skip writing in ->writepages when no dirty pages existChao Yu
When flushing comes from background, if there is no dirty page in the mapping of inode, we'd better to skip seeking dirty page from mapping for writebacking. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: optimize f2fs_write_cache_pagesTiezhu Yang
The if statement "goto continue_unlock" is exactly the same when each if condition is true that is depended on the value of both "step" and "is_cold_data(page)" are 0 or 1. That means when the value of "step" equals to "is_cold_data(page)", the if condition is true and the if statement "goto continue_unlock" appears only once, so it can be optimized to reduce the duplicated code. Signed-off-by: Tiezhu Yang <kernelpatch@126.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix double lock in handle_failed_inodeChao Yu
In handle_failed_inode, there is a potential deadlock which can happen in below call path: - f2fs_create - f2fs_lock_op down_read(cp_rwsem) - f2fs_add_link - __f2fs_add_link - init_inode_metadata - f2fs_init_security failed - truncate_blocks failed - handle_failed_inode - f2fs_truncate - truncate_blocks(..,true) - write_checkpoint - block_operations - f2fs_lock_all down_write(cp_rwsem) - f2fs_lock_op down_read(cp_rwsem) So in this path, we pass parameter to f2fs_truncate to make sure cp_rwsem in truncate_blocks will not be locked again. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: reduce region of cp_rwsem covered in f2fs_do_collapseChao Yu
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be held until all blocks are left shifted, so if we try to collapse small area at the beginning of large file, checkpoint who want to grab writer's lock of cp_rwsem will be delayed for long time. In order to avoid this condition, altering to lock/unlock cp_rwsem each shift operation. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: add new interfaces for extent treeFan Li
Add a lookup and a insertion interface for extent tree. The new lookup return the insert position and the prev/next extents closest to the offset we lookup when find no match. The new insertion uses above parameters to improve performance. There are three possible insertions after the lookup in f2fs_update_extent_tree, two of them insert parts of removed extent back to tree, since no merge happens during this process, new insertion skips the merge check in this scanario; the another insertion inserts a new extent to tree, new insertion uses prev/next extent and insert position to insert this extent directly, and save the time of searching down the tree. As long as tree remains unchanged between lookup and insertion, this would work fine. And the new lookup would be useful when add multi-blocks extent support for insertion interface. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: callers take care of the page from bio errorJaegeuk Kim
This patch changes for a caller to handle the page after its bio gets an error. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use atomic_t to record hit ratio info of extent cacheChao Yu
Variables for recording extent cache ratio info were updated without protection, this patch tries to alter them to atomic_t type for more accurate stat. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: stat inline xattr inode numberChao Yu
This patch adds to stat the number of inline xattr inode for showing in debugfs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use a page temporarily for encrypted gced pageJaegeuk Kim
That encrypted page is used temporarily, so we don't need to mark it accessed. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields. * 'for-4.2' of git://linux-nfs.org/~bfields/linux: nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid nfsd: Fix a file leak on nfsd4_layout_setlease failure nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem
2015-08-04may_follow_link() should use nd->inodeAl Viro
Now that we can get there in RCU mode, we shouldn't play with nd->path.dentry->d_inode - it's not guaranteed to be stable. Use nd->inode instead. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-04f2fs: expose f2fs_write_cache_pagesChao Yu
If there are gced dirty pages and normal dirty pages in the mapping of one inode, we might writeback them alternately with discontinuous block address, resulting in low performance. This patch introduces f2fs_write_cache_pages with codes copied from write_cache_pages in mm/page-writeback.c. In this function, we refactor flow with two steps: 1) writeback all cold type pages. 2) writeback all non-cold type pages. By using this method, f2fs will writeback dirty pages with the same temperature in bunch mode, it makes writeouted block being with more continuous address, so they can be merged as much as possible in f2fs bio cache, and also it will reduce the chance of submiting small IO from block layer. Test environment: 8g nokia sd card (very old sd card, but it shows better effect when testing with this patch, and with a 32g kingston sd card, I didn't see much more improvement). Test step: 1. touch testfile; 2. truncate -s 512K testfile; 3. write all pages with odd index; 4. trigger gc by ioctl; 5. write all pages with even index; 6. time fsync testfile. before: real 0m0.402s user 0m0.000s sys 0m0.000s after: real 0m0.143s user 0m0.004s sys 0m0.004s Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: correct return value of ->setxattrChao Yu
This patch fixes to return correct error number of ->setxattr, which is reported by xfstest tests/generic/026 as below: generic/026 - output mismatch --- tests/generic/026.out +++ results/generic/026.out.bad @@ -4,6 +4,6 @@ 1 below acl max acl max 1 above acl max -chacl: cannot set access acl on "largeaclfile": Argument list too long +chacl: cannot set access acl on "largeaclfile": Numerical result out of range use 16 aces use 17 aces ... Ran: generic/026 Failures: generic/026 Failed 1 of 1 tests Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: cleanup write_orphan_inodesChao Yu
Previously, since 'commit 4531929e3922 ("f2fs: move grabing orphan pages out of protection region")' was committed, in write_orphan_inodes(), we will grab all meta page in a batch before we use them under spinlock, so that we can avoid large time delay of grabbing meta pages under spinlock. Now, 'commit d6c67a4fee86 ("f2fs: revmove spin_lock for write_orphan_inodes")' remove the spinlock in write_orphan_inodes, so there is no issue we describe above, we'd better recover to move the grab operation to original place for readability. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: warm up cold page after mmaped writeChao Yu
With cost-benifit method, background gc will consider old section with fewer valid blocks as candidate victim, these old blocks in section will be treated as cold data, and laterly will be moved into cold segment. But if the gcing page is attached by user through buffered or mmaped write, we should reset the page as non-cold one, because this page may have more opportunity for further updating. So fix to add clearing code for the missed 'mmap' case. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: add new ioctl F2FS_IOC_GARBAGE_COLLECTChao Yu
When background gc is off, the only way to trigger gc is executing a force gc in some operations who wants to grab space in disk. The executing condition is limited: to execute force gc, we should wait for the time when there is almost no more free section for LFS allocation. This seems not reasonable for our user who wants to control triggering gc by himself. This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for triggering garbage collection by using ioctl. It provides our users one more option to trigger gc. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: maintain extent cache in separated fileChao Yu
This patch moves extent cache related code from data.c into extent_cache.c since extent cache is independent feature, and its codes are not relate to others in data.c, it's better for us to maintain them in separated place. There is no functionality change, but several small coding style fixes including: * rename __drop_largest_extent to f2fs_drop_largest_extent for exporting; * rename misspelled word 'untill' to 'until'; * remove unneeded 'return' in the end of f2fs_destroy_extent_tree(). Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: don't try to split extents shorter than F2FS_MIN_EXTENT_LENFan Li
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will be kept in extent cache after split, extents already shorter than F2FS_MIN_EXTENT_LEN don't need to try split at all. Signed-off-by: Fan Li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: fix to update page flagChao Yu
This patch fixes to update page flag (e.g. Uptodate/cold flag) in ->write_begin. Otherwise, page will be non-uptodate when we try to write entire page, and cold data flag in page will not be clean when gced page is being rewritten. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>