summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-08-07treewide: Fix typo in printkMasanari Iida
This patch fix spelling typo inv various part of sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07ipc: use private shmem or hugetlbfs inodes for shm segments.Stephen Smalley
The shm implementation internally uses shmem or hugetlbfs inodes for shm segments. As these inodes are never directly exposed to userspace and only accessed through the shm operations which are already hooked by security modules, mark the inodes with the S_PRIVATE flag so that inode security initialization and permission checking is skipped. This was motivated by the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W ------------------------------------------------------- httpd/1597 is trying to acquire lock: (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130 but task is already holding lock: (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++++}: lock_acquire+0xc7/0x270 __might_fault+0x7a/0xa0 filldir+0x9e/0x130 xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs] xfs_readdir+0x1b4/0x330 [xfs] xfs_file_readdir+0x2b/0x30 [xfs] iterate_dir+0x97/0x130 SyS_getdents+0x91/0x120 entry_SYSCALL_64_fastpath+0x12/0x76 -> #2 (&xfs_dir_ilock_class){++++.+}: lock_acquire+0xc7/0x270 down_read_nested+0x57/0xa0 xfs_ilock+0x167/0x350 [xfs] xfs_ilock_attr_map_shared+0x38/0x50 [xfs] xfs_attr_get+0xbd/0x190 [xfs] xfs_xattr_get+0x3d/0x70 [xfs] generic_getxattr+0x4f/0x70 inode_doinit_with_dentry+0x162/0x670 sb_finish_set_opts+0xd9/0x230 selinux_set_mnt_opts+0x35c/0x660 superblock_doinit+0x77/0xf0 delayed_superblock_init+0x10/0x20 iterate_supers+0xb3/0x110 selinux_complete_init+0x2f/0x40 security_load_policy+0x103/0x600 sel_write_load+0xc1/0x750 __vfs_write+0x37/0x100 vfs_write+0xa9/0x1a0 SyS_write+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 ... Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Morten Stevens <mstevens@fedoraproject.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ocfs2: fix shift left overflowJoseph Qi
When using a large volume, for example 9T volume with 2T already used, frequent creation of small files with O_DIRECT when the IO is not cluster aligned may clear sectors in the wrong place. This will cause filesystem corruption. This is because p_cpos is a u32. When calculating the corresponding sector it should be converted to u64 first, otherwise it may overflow. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara
fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and thus the next entry pointer we have cached may become stale and we dereference free memory. Fix the problem by first moving marks to free to a special private list and then always free the first entry in the special list. This method is safe even when entries from the list can disappear once we drop the lock. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07signalfd: fix information leak in signalfd_copyinfoAmanieu d'Antras
This function may copy the si_addr_lsb field to user mode when it hasn't been initialized, which can leak kernel stack data to user mode. Just checking the value of si_code is insufficient because the same si_code value is shared between multiple signals. This is solved by checking the value of si_signo in addition to si_code. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ocfs2: fix BUG in ocfs2_downconvert_thread_do_work()Joseph Qi
The "BUG_ON(list_empty(&osb->blocked_lock_list))" in ocfs2_downconvert_thread_do_work can be triggered in the following case: ocfs2dc has firstly saved osb->blocked_lock_count to local varibale processed, and then processes the dentry lockres. During the dentry put, it calls iput and then deletes rw, inode and open lockres from blocked list in ocfs2_mark_lockres_freeing. And this causes the variable `processed' to not reflect the number of blocked lockres to be processed, which triggers the BUG. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07fs, file table: reinit files_stat.max_files after deferred memory initialisationMel Gorman
Dave Hansen reported the following; My laptop has been behaving strangely with 4.2-rc2. Once I log in to my X session, I start getting all kinds of strange errors from applications and see this in my dmesg: VFS: file-max limit 8192 reached The problem is that the file-max is calculated before memory is fully initialised and miscalculates how much memory the kernel is using. This patch recalculates file-max after deferred memory initialisation. Note that using memory hotplug infrastructure would not have avoided this problem as the value is not recalculated after memory hot-add. 4.1: files_stat.max_files = 6582781 4.2-rc2: files_stat.max_files = 8192 4.2-rc2 patched: files_stat.max_files = 6562467 Small differences with the patch applied and 4.1 but not enough to matter. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Hansen <dave.hansen@intel.com> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Alex Ng <alexng@microsoft.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-06btrfs: qgroup: Fix a regression in qgroup reserved space.Qu Wenruo
During the change to new btrfs extent-oriented qgroup implement, due to it doesn't use the old __qgroup_excl_accounting() for exclusive extent, it didn't free the reserved bytes. The bug will cause limit function go crazy as the reserved space is never freed, increasing limit will have no effect and still cause EQOUT. The fix is easy, just free reserved bytes for newly created exclusive extent as what it does before. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-05f2fs: recover invalid/reserved block address for fsynced fileChao Yu
When testing with generic/101 in xfstests, error message outputed as below: --- tests/generic/101.out +++ results//generic/101.out.bad @@ -10,10 +10,14 @@ File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * -0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0372000 ... (Run 'diff -u tests/generic/101.out results/generic/101.out.bad' to see the entire diff) The test flow is like below: 1. pwrite foo -S 0xaa 0 64K 2. pwrite foo -S 0xbb 64K 61K 3. sync 4. truncate foo 64K 5. truncate foo 125K 6. fsync foo 7. flakey drop writes 8. umount After this test, we expect the data of recovered file will have the first 64k of data filling with value 0xaa and the next 61k of data filling with value 0x00 because we have fsynced it before dropping writes in dm. In f2fs, during recovering, we will only recover the valid block address in direct node page if it is marked as a fsynced dnode, but block address which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be recovered. So, the file recovered shows its incorrect data 0xbb in range of [61k, 125k]. In this patch, we fix to recover invalid/reserved block during recover flow. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use extent cache to optimize f2fs_reserve_blockFan Li
In some cases, we only need the block address when we call f2fs_reserve_block, other fields of struct dnode_of_data aren't necessary. We can try extent cache first for such cases in order to speed up the process. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05fs/char_dev.c: fix incorrect documentation for unregister_chrdev_regionPartha Pratim Mukherjee
The current documentation for unregister_chrdev_region says that it return a range of device numbers which is incorrect. Instead it unregister a range of device numbers. Fix the documentation to make this clear. Signed-off-by: Partha Pratim Mukherjee <ppm.floss@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05xprtrdma: Fix large NFS SYMLINK callsChuck Lever
Repair how rpcrdma_marshal_req() chooses which RDMA message type to use for large non-WRITE operations so that it picks RDMA_NOMSG in the correct situations, and sets up the marshaling logic to SEND only the RPC/RDMA header. Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS server XDR decoder for NFSv2 SYMLINK does not handle having the pathname argument arrive in a separate buffer. The decoder could be fixed, but this is simpler and RDMA_NOMSG can be used in a variety of other situations. Ensure that the Linux client continues to use "RDMA_MSG + read list" when sending large NFSv3 SYMLINK requests, which is more efficient than using RDMA_NOMSG. Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG + read list" just like NFSv3 (see Section 5 of RFC 5667). Before, these did not work at all. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05char: make misc_deregister a void functionGreg Kroah-Hartman
With well over 200+ users of this api, there are a mere 12 users that actually checked the return value of this function. And all of them really didn't do anything with that information as the system or module was shutting down no matter what. So stop pretending like it matters, and just return void from misc_deregister(). If something goes wrong in the call, you will get a WARNING splat in the syslog so you know how to fix up your driver. Other than that, there's nothing that can go wrong. Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05f2fs: invalidate temporary meta pageChao Yu
To avoid meeting garbage data in next free node block at the end of warm node chain when doing recovery, we will try to zero out that invalid block. If the device is not support discard, our way for zeroing out block is: grabbing a temporary zeroed page in meta inode, then, issue write request with this page. But, we forget to release that temporary page, so our memory usage will increase without gaining any hit ratio benefit, so it's better to free it for saving memory. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to release inode page correctlyChao Yu
In following call path, we will pass a locked and referenced ipage pointer to get_new_data_page: - init_inode_metadata - make_empty_dir - get_new_data_page There are two exit paths in get_new_data_page when error occurs: 1) grab_cache_page fails, ipage will not be released; 2) f2fs_reserve_block fails, ipage will be released in callee. So, it's not consistent for error handling in get_new_data_page. For f2fs_reserve_block, it's not very easy to change the rule of error handling, since it's already complicated. Here we deside to choose an easy way to fix this issue: If any error occur in get_new_data_page, we will ensure releasing ipage in this function. The same issue is in f2fs_convert_inline_dir, fix that too. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: unify f2fs_bug_on when check blocks and segmentLiu Xue
Replace BUG_ON with f2fs_bug_on to deal with block and segment validity check failed. Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: freeze filesystem when fail to update meta page due to IO errorChao Yu
In get_meta_page, we guarantee no failure for the returned page, but sometimes, IO error from device will incur returning an non-updated page. Then, we still use this page as updated one, exception could happen when using this kind of page. So in this condition, we'd better freeze fs by making fs readonly and and stop doing checkpoint. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: change the timing of f2fs_wait_on_page_writebackFan Li
some backing devices need pages to be stable during writeback. It doesn't matter if the page is completely overwritten or already uptodate, it needs to wait before write. Signed-off-by: Fan li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: handle error cases in commit_inmem_pagesJaegeuk Kim
This patch adds to handle error cases in commit_inmem_pages. If an error occurs, it stops to write the pages and return the error right away. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to build free nids from readaheaded nat pagesChao Yu
When there is no enough free nids in free nid cache, we will try to readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode, then, reading nat entries in nat page for adding free nids to free nid cache. But when traversing all nat pages we readaheaded in a circulation, our exit condition is not set right, one more nat page will be scanned without readaheading, resulting worse read performance. This patch fixes to read the correct number nat pages to avoid bad performance. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix inline data/dentry stat number leakChao Yu
If we clear inline data/dentry flag in handle_failed_inode, we will fail to decline the stat count of inline data/dentry in f2fs_evict_inode due to no flag in inode. So remove the wrong clearing. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: convert inline data before set atomic/volatile flagChao Yu
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting inline data, we will report error to user, but still remain atomic/volatile flag in inode, it will impact further writes for this file. Fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix to wait all atomic written pages writebackChao Yu
This patch fixes the incorrect range (0, LONG_MAX) which is used in ranged fsync. If we use LONG_MAX as the parameter for indicating the end of file we want to synchronize, in 32-bits architecture machine, these datas after 4GB offset may not be persisted in storage after ->fsync returned. Here, we alter LONG_MAX to LLONG_MAX to fix this issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: skip writing in ->writepages when no dirty pages existChao Yu
When flushing comes from background, if there is no dirty page in the mapping of inode, we'd better to skip seeking dirty page from mapping for writebacking. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: optimize f2fs_write_cache_pagesTiezhu Yang
The if statement "goto continue_unlock" is exactly the same when each if condition is true that is depended on the value of both "step" and "is_cold_data(page)" are 0 or 1. That means when the value of "step" equals to "is_cold_data(page)", the if condition is true and the if statement "goto continue_unlock" appears only once, so it can be optimized to reduce the duplicated code. Signed-off-by: Tiezhu Yang <kernelpatch@126.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: fix double lock in handle_failed_inodeChao Yu
In handle_failed_inode, there is a potential deadlock which can happen in below call path: - f2fs_create - f2fs_lock_op down_read(cp_rwsem) - f2fs_add_link - __f2fs_add_link - init_inode_metadata - f2fs_init_security failed - truncate_blocks failed - handle_failed_inode - f2fs_truncate - truncate_blocks(..,true) - write_checkpoint - block_operations - f2fs_lock_all down_write(cp_rwsem) - f2fs_lock_op down_read(cp_rwsem) So in this path, we pass parameter to f2fs_truncate to make sure cp_rwsem in truncate_blocks will not be locked again. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: reduce region of cp_rwsem covered in f2fs_do_collapseChao Yu
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be held until all blocks are left shifted, so if we try to collapse small area at the beginning of large file, checkpoint who want to grab writer's lock of cp_rwsem will be delayed for long time. In order to avoid this condition, altering to lock/unlock cp_rwsem each shift operation. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: add new interfaces for extent treeFan Li
Add a lookup and a insertion interface for extent tree. The new lookup return the insert position and the prev/next extents closest to the offset we lookup when find no match. The new insertion uses above parameters to improve performance. There are three possible insertions after the lookup in f2fs_update_extent_tree, two of them insert parts of removed extent back to tree, since no merge happens during this process, new insertion skips the merge check in this scanario; the another insertion inserts a new extent to tree, new insertion uses prev/next extent and insert position to insert this extent directly, and save the time of searching down the tree. As long as tree remains unchanged between lookup and insertion, this would work fine. And the new lookup would be useful when add multi-blocks extent support for insertion interface. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: callers take care of the page from bio errorJaegeuk Kim
This patch changes for a caller to handle the page after its bio gets an error. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use atomic_t to record hit ratio info of extent cacheChao Yu
Variables for recording extent cache ratio info were updated without protection, this patch tries to alter them to atomic_t type for more accurate stat. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: stat inline xattr inode numberChao Yu
This patch adds to stat the number of inline xattr inode for showing in debugfs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05f2fs: use a page temporarily for encrypted gced pageJaegeuk Kim
That encrypted page is used temporarily, so we don't need to mark it accessed. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields. * 'for-4.2' of git://linux-nfs.org/~bfields/linux: nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid nfsd: Fix a file leak on nfsd4_layout_setlease failure nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem
2015-08-04may_follow_link() should use nd->inodeAl Viro
Now that we can get there in RCU mode, we shouldn't play with nd->path.dentry->d_inode - it's not guaranteed to be stable. Use nd->inode instead. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-04f2fs: expose f2fs_write_cache_pagesChao Yu
If there are gced dirty pages and normal dirty pages in the mapping of one inode, we might writeback them alternately with discontinuous block address, resulting in low performance. This patch introduces f2fs_write_cache_pages with codes copied from write_cache_pages in mm/page-writeback.c. In this function, we refactor flow with two steps: 1) writeback all cold type pages. 2) writeback all non-cold type pages. By using this method, f2fs will writeback dirty pages with the same temperature in bunch mode, it makes writeouted block being with more continuous address, so they can be merged as much as possible in f2fs bio cache, and also it will reduce the chance of submiting small IO from block layer. Test environment: 8g nokia sd card (very old sd card, but it shows better effect when testing with this patch, and with a 32g kingston sd card, I didn't see much more improvement). Test step: 1. touch testfile; 2. truncate -s 512K testfile; 3. write all pages with odd index; 4. trigger gc by ioctl; 5. write all pages with even index; 6. time fsync testfile. before: real 0m0.402s user 0m0.000s sys 0m0.000s after: real 0m0.143s user 0m0.004s sys 0m0.004s Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: correct return value of ->setxattrChao Yu
This patch fixes to return correct error number of ->setxattr, which is reported by xfstest tests/generic/026 as below: generic/026 - output mismatch --- tests/generic/026.out +++ results/generic/026.out.bad @@ -4,6 +4,6 @@ 1 below acl max acl max 1 above acl max -chacl: cannot set access acl on "largeaclfile": Argument list too long +chacl: cannot set access acl on "largeaclfile": Numerical result out of range use 16 aces use 17 aces ... Ran: generic/026 Failures: generic/026 Failed 1 of 1 tests Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: cleanup write_orphan_inodesChao Yu
Previously, since 'commit 4531929e3922 ("f2fs: move grabing orphan pages out of protection region")' was committed, in write_orphan_inodes(), we will grab all meta page in a batch before we use them under spinlock, so that we can avoid large time delay of grabbing meta pages under spinlock. Now, 'commit d6c67a4fee86 ("f2fs: revmove spin_lock for write_orphan_inodes")' remove the spinlock in write_orphan_inodes, so there is no issue we describe above, we'd better recover to move the grab operation to original place for readability. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: warm up cold page after mmaped writeChao Yu
With cost-benifit method, background gc will consider old section with fewer valid blocks as candidate victim, these old blocks in section will be treated as cold data, and laterly will be moved into cold segment. But if the gcing page is attached by user through buffered or mmaped write, we should reset the page as non-cold one, because this page may have more opportunity for further updating. So fix to add clearing code for the missed 'mmap' case. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: add new ioctl F2FS_IOC_GARBAGE_COLLECTChao Yu
When background gc is off, the only way to trigger gc is executing a force gc in some operations who wants to grab space in disk. The executing condition is limited: to execute force gc, we should wait for the time when there is almost no more free section for LFS allocation. This seems not reasonable for our user who wants to control triggering gc by himself. This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for triggering garbage collection by using ioctl. It provides our users one more option to trigger gc. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: maintain extent cache in separated fileChao Yu
This patch moves extent cache related code from data.c into extent_cache.c since extent cache is independent feature, and its codes are not relate to others in data.c, it's better for us to maintain them in separated place. There is no functionality change, but several small coding style fixes including: * rename __drop_largest_extent to f2fs_drop_largest_extent for exporting; * rename misspelled word 'untill' to 'until'; * remove unneeded 'return' in the end of f2fs_destroy_extent_tree(). Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: don't try to split extents shorter than F2FS_MIN_EXTENT_LENFan Li
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will be kept in extent cache after split, extents already shorter than F2FS_MIN_EXTENT_LEN don't need to try split at all. Signed-off-by: Fan Li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: fix to update page flagChao Yu
This patch fixes to update page flag (e.g. Uptodate/cold flag) in ->write_begin. Otherwise, page will be non-uptodate when we try to write entire page, and cold data flag in page will not be clean when gced page is being rewritten. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: shrink unreferenced extent_caches firstJaegeuk Kim
If an extent_tree entry has a zero reference count, we can drop it from the cache in higher priority rather than currently referencing entries. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: enhance multithread performanceChao Yu
In ->writepages, we use writepages mutex lock to serialize all block address allocation and page submitting pairs from different inodes. This method makes our delayed dirty pages of one inode being written continously as many as possible. But there is one problem that we did not submit current cached bio in protection region of writepages mutex lock, so there is a small chance that we submit the one of other thread's as below, resulting in splitting more bios. thread 1 thread 2 ->writepages lock(writepages) ->write_cache_pages unlock(writepages) lock(writepages) ->write_cache_pages ->f2fs_submit_merged_bio ->writepage unlock(writepages) fs_mark-6535 [002] .... 2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288 fs_mark-6536 [000] .... 2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096 fs_mark-6536 [000] .... 2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096 fs_mark-6535 [002] .... 2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096 This may really increase time of block layer works, and may cause larger IO lantency. This patch moves the submitting operation into region of writepages mutex lock to avoid bio splits when concurrently writebacking is intensive. my test environment: virtual machine, intel cpu i5 2500, 8GB size memory, 4GB size ramdisk time fs_mark -t 16 -L 1 -s 524288 -S 1 -d /mnt/f2fs/ before: real 0m4.244s user 0m0.088s sys 0m12.336s after: real 0m3.822s user 0m0.072s sys 0m10.760s Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: restrict multimedia filenameChao Yu
When testing with fs_mark, some blocks were written out as cold data which were mixed with warm data, resulting in splitting more bios. This is because fs_mark will create file with random filename as below: 559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV 559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM 559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM 559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3 559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM 559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4 559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE 559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG They are regarded as cold file since their filename are ended with multimedia files' extension, but this should be wrong as we only match the extension of filename, not the whole one. In this patch, we try to fix the format of multimedia filename to: "filename + '.' + extension", then we set cold file only its filename matches the format. So after this change, it will reduce the probability we set the wrong cold file, also it helps a little for fs_mark's performance on f2fs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: make the function check_dnode have a return type of bool and change ↵Nicholas Krause
it's name to is_alive This makes the function check_dnode have a return type of bool due to this particular function only ever returning either one or zero as its return value and changes the name of the function to is_alive in order to better explain this function's intended work of checking if a dnode is still in use by the filesystem. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> [Jaegeuk Kim: change the return value check for the renamed function] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: check the largest extent at look-up timeJaegeuk Kim
Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee that the largest extent would be cached in the tree all the time. Instead of relying on extent_tree, we can simply check the cached one in extent tree accordingly. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: use extent_cache by defaultJaegeuk Kim
We don't need to handle the duplicate extent information. The integrated rule is: - update on-disk extent with largest one tracked by in-memory extent_cache - destroy extent_tree for the truncation case - drop per-inode extent_cache by shrinker Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: add noextent_cache mount optionJaegeuk Kim
This patch adds noextent_cache mount option. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04f2fs: shrink extent_cache entriesJaegeuk Kim
This patch registers shrinking extent_caches. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>