summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-09-15xfs: reuse _xfs_buf_read for re-reading the superblockChristoph Hellwig
Instead of poking deeply into buffer cache internals when re-reading the superblock during log recovery just generalize _xfs_buf_read and use it there. Note that we don't have to explicitly set up the ops as they must be set from the initial read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: remove xfs_getsbChristoph Hellwig
Merge xfs_getsb into its only caller, and clean that one up a little bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: simplify xfs_trans_getsbChristoph Hellwig
Remove the mp argument as this function is only called in transaction context, and open code xfs_getsb given that the function already accesses the buffer pointer in the mount point directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: remove xlog_recover_iodoneChristoph Hellwig
The log recovery I/O completion handler does not substancially differ from the normal one except for the fact that it: a) never retries failed writes b) can have log items that aren't on the AIL c) never has inode/dquot log items attached and thus don't need to handle them Add conditionals for (a) and (b) to the ioend code, while (c) doesn't need special handling anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: clear the read/write flags later in xfs_buf_ioendChristoph Hellwig
Clear the flags at the end of xfs_buf_ioend so that they can be used during the completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: use xfs_buf_item_relse in xfs_buf_item_doneChristoph Hellwig
Reuse xfs_buf_item_relse instead of duplicating it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: simplify the xfs_buf_ioend_disposition calling conventionChristoph Hellwig
Now that all the actual error handling is in a single place, xfs_buf_ioend_disposition just needs to return true if took ownership of the buffer, or false if not instead of the tristate. Also move the error check back in the caller to optimize for the fast path, and give the function a better fitting name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: lift the XBF_IOEND_FAIL handling into xfs_buf_ioend_dispositionChristoph Hellwig
Keep all the error handling code together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: remove xfs_buf_ioerror_retryChristoph Hellwig
Merge xfs_buf_ioerror_retry into its only caller to make the resubmission flow a little easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: refactor xfs_buf_ioerror_fail_without_retryChristoph Hellwig
xfs_buf_ioerror_fail_without_retry is a somewhat weird function in that it has two trivial checks that decide the return value, while the rest implements a ratelimited warning. Just lift the two checks into the caller, and give the remainder a suitable name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: fold xfs_buf_ioend_finish into xfs_ioendChristoph Hellwig
No need to keep a separate helper for this logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: move the buffer retry logic to xfs_buf.cChristoph Hellwig
Move the buffer retry state machine logic to xfs_buf.c and call it once from xfs_ioend instead of duplicating it three times for the three kinds of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: refactor xfs_buf_ioendChristoph Hellwig
Move the log recovery I/O completion handling entirely into the log recovery code, and re-arrange the normal I/O completion handler flow to prepare to lifting more logic into common code in the next commits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: mark xfs_buf_ioend staticChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: refactor the buf ioend disposition codeChristoph Hellwig
Handle the no-error case in xfs_buf_iodone_error as well, and to clarify the code rename the function, use the actual enum type as return value and then switch on it in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-16mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching raceNicholas Piggin
Reading and modifying current->mm and current->active_mm and switching mm should be done with irqs off, to prevent races seeing an intermediate state. This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate"). At exec-time when the new mm is activated, the old one should usually be single-threaded and no longer used, unless something else is holding an mm_users reference (which may be possible). Absent other mm_users, there is also a race with preemption and lazy tlb switching. Consider the kernel_execve case where the current thread is using a lazy tlb active mm: call_usermodehelper() kernel_execve() old_mm = current->mm; active_mm = current->active_mm; *** preempt *** --------------------> schedule() prev->active_mm = NULL; mmdrop(prev active_mm); ... <-------------------- schedule() current->mm = mm; current->active_mm = mm; if (!old_mm) mmdrop(active_mm); If we switch back to the kernel thread from a different mm, there is a double free of the old active_mm, and a missing free of the new one. Closing this race only requires interrupts to be disabled while ->mm and ->active_mm are being switched, but the TLB problem requires also holding interrupts off over activate_mm. Unfortunately not all archs can do that yet, e.g., arm defers the switch if irqs are disabled and expects finish_arch_post_lock_switch() to be called to complete the flush; um takes a blocking lock in activate_mm(). So as a first step, disable interrupts across the mm/active_mm updates to close the lazy tlb preempt race, and provide an arch option to extend that to activate_mm which allows architectures doing IPI based TLB shootdowns to close the second race. This is a bit ugly, but in the interest of fixing the bug and backporting before all architectures are converted this is a compromise. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-15zonefs: open/close zone on file open/closeJohannes Thumshirn
NVMe Zoned Namespace introduced the concept of active zones, which are zones in the implicit open, explicit open or closed condition. Drives may have a limit on the number of zones that can be simultaneously active. This potential limitation translate into a risk for applications to see write IO errors due to this limit if the zone of a file being written to is not already active when a write request is issued. To avoid these potential errors, the zone of a file can explicitly be made active using an open zone command when the file is open for the first time. If the zone open command succeeds, the application is then guaranteed that write requests can be processed. This indirect management of active zones relies on the maximum number of open zones of a drive, which is always lower or equal to the maximum number of active zones. On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN command to the block device. Conversely, on the last release of a zone file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or empty. As truncating a zone file to 0 or max can deactivate a zone as well, we need to serialize against truncates and also be careful not to close a zone as the file may still be open for writing, e.g. the user called ftruncate(). If the zone file is not open and a process does a truncate(), then no close operation is needed. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: provide no-lock zonefs_io_error variantJohannes Thumshirn
Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex already held, so factor out the body of zonefs_io_error() into __zonefs_io_error() which can be called from with the i_truncate_mutex held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-15zonefs: introduce helper for zone managementJohannes Thumshirn
Introduce a helper function for sending zone management commands to the block device. As zone management commands can change a zone write pointer position reflected in the size of the zone file, this function expects the truncate mutex to be held. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-14Merge tag 'for-5.9-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One of the recent lockdep fixes introduced a bug that breaks the search ioctl, which is used by some applications (bees, compsize). The patch made it to stable trees so we need this fixup to make it work again" * tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix wrong address when faulting in pages in the search ioctl
2020-09-14f2fs: clean up kvfreeChao Yu
After commit 0b6d4ca04a86 ("f2fs: don't return vmalloc() memory from f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed memory, so clean up to use kfree() instead of kvfree() to free vmalloc()'ed memory. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-14io_uring: don't run task work on an exiting taskJens Axboe
This isn't safe, and isn't needed either. We are guaranteed that any work we queue is on a live task (and will be run), or it goes to our backup io-wq threads if the task is exiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14io_uring: drop 'ctx' ref on task work cancelationJens Axboe
If task_work ends up being marked for cancelation, we go through a cancelation helper instead of the queue path. In converting task_work to always hold a ctx reference, this path was missed. Make sure that io_req_task_cancel() puts the reference that is being held against the ctx. Fixes: 6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14btrfs: fix wrong address when faulting in pages in the search ioctlFilipe Manana
When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") was added to stable trees, and these applications make heavy use of the search ioctls. This fixes their issues. Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/ Link: https://github.com/kilobyte/compsize/issues/34 Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") CC: stable@vger.kernel.org # 4.4+ Tested-by: A L <mail@lechevalier.se> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-14ext2: Fix some kernel-doc warnings in balloc.cWang Hai
Fixes the following W=1 kernel build warning(s): fs/ext2/balloc.c:203: warning: Excess function parameter 'rb_root' description in '__rsv_window_dump' fs/ext2/balloc.c:294: warning: Excess function parameter 'rb_root' description in 'search_reserve_window' fs/ext2/balloc.c:878: warning: Excess function parameter 'rsv' description in 'alloc_new_reservation' Link: https://lore.kernel.org/r/20200911114036.60616-1-wanghai38@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-09-13io_uring: grab any needed state during defer prepJens Axboe
Always grab work environment for deferred links. The assumption that we will be running it always from the task in question is false, as exiting tasks may mean that we're deferring this one to a thread helper. And at that point it's too late to grab the work environment. Fixes: debb85f496c9 ("io_uring: factor out grab_env() from defer_prep()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-13Merge tag 'driver-core-5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and debugfs fixes for 5.9-rc5 Included in here are: - firmware loader memory leak fix - firmware loader testing fixes for non-EFI systems - device link locking fixes found by lockdep - kobject_del() bugfix that has been affecting some callers - debugfs minor fix All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: test_firmware: Test platform fw loading on non-EFI systems PM: <linux/device.h>: fix @em_pd kernel-doc warning kobject: Drop unneeded conditional in __kobject_del() driver core: Fix device_pm_lock() locking for device links MAINTAINERS: Add the security document to SECURITY CONTACT driver code: print symbolic error code debugfs: Fix module state check condition kobject: Restore old behaviour of kobject_del(NULL) firmware_loader: fix memory leak for paged buffer
2020-09-12Merge tag 'for-5.9-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - regression fix for a crash after failed snapshot creation - one more lockep fix: use nofs allocation when allocating missing device - fix reloc tree leak on degraded mount - make some extent buffer alignment checks less strict to mount filesystems created by btrfs-convert" * tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix NULL pointer dereference after failure to create snapshot btrfs: free data reloc tree on failed mount btrfs: require only sector size alignment for parent eb bytenr btrfs: fix lockdep splat in add_missing_dev
2020-09-12Merge tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fix from Steve French: "A fix for lookup on DFS link when cifsacl or modefromsid is used" * tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix DFS mount with cifsacl/modefromsid
2020-09-11f2fs: change virtual mapping way for compression pagesDaeho Jeong
By profiling f2fs compression works, I've found vmap() callings have unexpected hikes in the execution time in our test environment and those are bottlenecks of f2fs decompression path. Changing these with vm_map_ram(), we can enhance f2fs decompression speed pretty much. [Verification] Android Pixel 3(ARM64, 6GB RAM, 128GB UFS) Turned on only 0-3 little cores(at 1.785GHz) dd if=/dev/zero of=dummy bs=1m count=1000 echo 3 > /proc/sys/vm/drop_caches dd if=dummy of=/dev/zero bs=512k - w/o compression - 1048576000 bytes (0.9 G) copied, 2.082554 s, 480 M/s 1048576000 bytes (0.9 G) copied, 2.081634 s, 480 M/s 1048576000 bytes (0.9 G) copied, 2.090861 s, 478 M/s - before patch - 1048576000 bytes (0.9 G) copied, 7.407527 s, 135 M/s 1048576000 bytes (0.9 G) copied, 7.283734 s, 137 M/s 1048576000 bytes (0.9 G) copied, 7.291508 s, 137 M/s - after patch - 1048576000 bytes (0.9 G) copied, 1.998959 s, 500 M/s 1048576000 bytes (0.9 G) copied, 1.987554 s, 503 M/s 1048576000 bytes (0.9 G) copied, 1.986380 s, 503 M/s Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change return value of f2fs_disable_compressed_file to boolDaeho Jeong
The returned integer is not required anywhere. So we need to change the return value to bool type. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change i_compr_blocks of inode to atomic valueDaeho Jeong
writepages() can be concurrently invoked for the same file by different threads such as a thread fsyncing the file and a kworker kernel thread. So, changing i_compr_blocks without protection is racy and we need to protect it by changing it with atomic type value. Plus, we don't need a 64bit value for i_compr_blocks, so just we will use a atomic value, not atomic64. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: ignore compress mount option on image w/o compression featureChao Yu
to keep consistent with behavior when passing compress mount option to kernel w/o compression feature, so that mount may not fail on such condition. Reported-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: allocate proper size memory for zstd decompressChao Yu
As 5kft <5kft@5kft.org> reported: kworker/u9:3: page allocation failure: order:9, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 3 PID: 8168 Comm: kworker/u9:3 Tainted: G C 5.8.3-sunxi #trunk Hardware name: Allwinner sun8i Family Workqueue: f2fs_post_read_wq f2fs_post_read_work [<c010d6d5>] (unwind_backtrace) from [<c0109a55>] (show_stack+0x11/0x14) [<c0109a55>] (show_stack) from [<c056d489>] (dump_stack+0x75/0x84) [<c056d489>] (dump_stack) from [<c0243b53>] (warn_alloc+0xa3/0x104) [<c0243b53>] (warn_alloc) from [<c024473b>] (__alloc_pages_nodemask+0xb87/0xc40) [<c024473b>] (__alloc_pages_nodemask) from [<c02267c5>] (kmalloc_order+0x19/0x38) [<c02267c5>] (kmalloc_order) from [<c02267fd>] (kmalloc_order_trace+0x19/0x90) [<c02267fd>] (kmalloc_order_trace) from [<c047c665>] (zstd_init_decompress_ctx+0x21/0x88) [<c047c665>] (zstd_init_decompress_ctx) from [<c047e9cf>] (f2fs_decompress_pages+0x97/0x228) [<c047e9cf>] (f2fs_decompress_pages) from [<c045d0ab>] (__read_end_io+0xfb/0x130) [<c045d0ab>] (__read_end_io) from [<c045d141>] (f2fs_post_read_work+0x61/0x84) [<c045d141>] (f2fs_post_read_work) from [<c0130b2f>] (process_one_work+0x15f/0x3b0) [<c0130b2f>] (process_one_work) from [<c0130e7b>] (worker_thread+0xfb/0x3e0) [<c0130e7b>] (worker_thread) from [<c0135c3b>] (kthread+0xeb/0x10c) [<c0135c3b>] (kthread) from [<c0100159>] zstd may allocate large size memory for {,de}compression, it may cause file copy failure on low-end device which has very few memory. For decompression, let's just allocate proper size memory based on current file's cluster size instead of max cluster size. Reported-by: 5kft <5kft@5kft.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: change compr_blocks of superblock info to 64bitDaeho Jeong
Current compr_blocks of superblock info is not 64bit value. We are accumulating each i_compr_blocks count of inodes to this value and those are 64bit values. So, need to change this to 64bit value. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: add block address limit check to compressed fileDaeho Jeong
Need to add block address range check to compressed file case and avoid calling get_data_block_bmap() for compressed file. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: check position in move range ioctlDan Robertson
When the move range ioctl is used, check the input and output position and ensure that it is a non-negative value. Without this check f2fs_get_dnode_of_data may hit a memmory bug. Signed-off-by: Dan Robertson <dan@dlrobertson.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: correct statistic of APP_DIRECT_IO/APP_DIRECT_READ_IOJack Qiu
Miss to update APP_DIRECT_IO/APP_DIRECT_READ_IO when receiving async DIO. For example: fio -filename=/data/test.0 -bs=1m -ioengine=libaio -direct=1 -name=fill -size=10m -numjobs=1 -iodepth=32 -rw=write Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: Simplify SEEK_DATA implementationMatthew Wilcox (Oracle)
Instead of finding the first dirty page and then seeing if it matches the index of a block that is NEW_ADDR, delay the lookup of the dirty bit until we've actually found a block that's NEW_ADDR. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11f2fs: support age threshold based garbage collectionChao Yu
There are several issues in current background GC algorithm: - valid blocks is one of key factors during cost overhead calculation, so if segment has less valid block, however even its age is young or it locates hot segment, CB algorithm will still choose the segment as victim, it's not appropriate. - GCed data/node will go to existing logs, no matter in-there datas' update frequency is the same or not, it may mix hot and cold data again. - GC alloctor mainly use LFS type segment, it will cost free segment more quickly. This patch introduces a new algorithm named age threshold based garbage collection to solve above issues, there are three steps mainly: 1. select a source victim: - set an age threshold, and select candidates beased threshold: e.g. 0 means youngest, 100 means oldest, if we set age threshold to 80 then select dirty segments which has age in range of [80, 100] as candiddates; - set candidate_ratio threshold, and select candidates based the ratio, so that we can shrink candidates to those oldest segments; - select target segment with fewest valid blocks in order to migrate blocks with minimum cost; 2. select a target victim: - select candidates beased age threshold; - set candidate_radius threshold, search candidates whose age is around source victims, searching radius should less than the radius threshold. - select target segment with most valid blocks in order to avoid migrating current target segment. 3. merge valid blocks from source victim into target victim with SSR alloctor. Test steps: - create 160 dirty segments: * half of them have 128 valid blocks per segment * left of them have 384 valid blocks per segment - run background GC Benefit: GC count and block movement count both decrease obviously: - Before: - Valid: 86 - Dirty: 1 - Prefree: 11 - Free: 6001 (6001) GC calls: 162 (BG: 220) - data segments : 160 (160) - node segments : 2 (2) Try to move 41454 blocks (BG: 41454) - data blocks : 40960 (40960) - node blocks : 494 (494) IPU: 0 blocks SSR: 0 blocks in 0 segments LFS: 41364 blocks in 81 segments - After: - Valid: 87 - Dirty: 0 - Prefree: 4 - Free: 6008 (6008) GC calls: 75 (BG: 76) - data segments : 74 (74) - node segments : 1 (1) Try to move 12813 blocks (BG: 12813) - data blocks : 12544 (12544) - node blocks : 269 (269) IPU: 0 blocks SSR: 12032 blocks in 77 segments LFS: 855 blocks in 2 segments Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10epoll: EPOLL_CTL_ADD: close the race in decision to take fast pathAl Viro
Checking for the lack of epitems refering to the epoll we want to insert into is not enough; we might have an insertion of that epoll into another one that has already collected the set of files to recheck for excessive reverse paths, but hasn't gotten to creating/inserting the epitem for it. However, any such insertion in progress can be detected - it will update the generation count in our epoll when it's done looking through it for files to check. That gets done under ->mtx of our epoll and that allows us to detect that safely. We are *not* holding epmutex here, so the generation count is not stable. However, since both the update of ep->gen by loop check and (later) insertion into ->f_ep_link are done with ep->mtx held, we are fine - the sequence is grab epmutex bump loop_check_gen ... grab tep->mtx // 1 tep->gen = loop_check_gen ... drop tep->mtx // 2 ... grab tep->mtx // 3 ... insert into ->f_ep_link ... drop tep->mtx // 4 bump loop_check_gen drop epmutex and if the fastpath check in another thread happens for that eventpoll, it can come * before (1) - in that case fastpath is just fine * after (4) - we'll see non-empty ->f_ep_link, slow path taken * between (2) and (3) - loop_check_gen is stable, with ->mtx providing barriers and we end up taking slow path. Note that ->f_ep_link emptiness check is slightly racy - we are protected against insertions into that list, but removals can happen right under us. Not a problem - in the worst case we'll end up taking a slow path for no good reason. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-10f2fs: Use generic casefolding supportDaniel Rosenberg
This switches f2fs over to the generic support provided in the previous patch. Since casefolded dentries behave the same in ext4 and f2fs, we decrease the maintenance burden by unifying them, and any optimizations will immediately apply to both. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10fs: Add standard casefolding supportDaniel Rosenberg
This adds general supporting functions for filesystems that use utf8 casefolding. It provides standard dentry_operations and adds the necessary structures in struct super_block to allow this standardization. The new dentry operations are functionally equivalent to the existing operations in ext4 and f2fs, apart from the use of utf8_casefold_hash to avoid an allocation. By providing a common implementation, all users can benefit from any optimizations without needing to port over improvements. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10unicode: Add utf8_casefold_hashDaniel Rosenberg
This adds a case insensitive hash function to allow taking the hash without needing to allocate a casefolded copy of the string. The existing d_hash implementations for casefolding allocate memory within rcu-walk, by avoiding it we can be more efficient and avoid worrying about a failed allocation. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: compress: use more readable atomic_t type for {cic,dic}.refChao Yu
refcount_t type variable should never be less than one, so it's a little bit hard to understand when we use it to indicate pending compressed page count, let's change to use atomic_t for better readability. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: fix compile warningChao Yu
This patch fixes below compile warning reported by LKP (kernel test robot) cppcheck warnings: (new ones prefixed by >>) >> fs/f2fs/file.c:761:9: warning: Identical condition 'err', second condition is always false [identicalConditionAfterEarlyExit] return err; ^ fs/f2fs/file.c:753:6: note: first condition if (err) ^ fs/f2fs/file.c:761:9: note: second condition return err; Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: support 64-bits key in f2fs rb-tree node entryChao Yu
then, we can add specified entry into rb-tree with 64-bits segment time as key. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: inherit mtime of original block during GCChao Yu
Don't let f2fs inner GC ruins original aging degree of segment. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: record average update time of segmentChao Yu
Previously, once we update one block in segment, we will update mtime of segment to last time, making aged segment becoming freshest, result in that GC with cost benefit algorithm missing such segment, So this patch changes to record mtime as average block updating time instead of last updating time. It's not needed to reset mtime for prefree segment, as se->valid_blocks is zero, then old se->mtime won't take any weight with below calculation: se->mtime = div_u64(se->mtime * se->valid_blocks + mtime, se->valid_blocks + 1); Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-10f2fs: introduce inmem cursegChao Yu
Previous implementation of aligned pinfile allocation will: - allocate new segment on cold data log no matter whether last used segment is partially used or not, it makes IOs more random; - force concurrent cold data/GCed IO going into warm data area, it can make a bad effect on hot/cold data separation; In this patch, we introduce a new type of log named 'inmem curseg', the differents from normal curseg is: - it reuses existed segment type (CURSEG_XXX_NODE/DATA); - it only exists in memory, its segno, blkofs, summary will not b persisted into checkpoint area; With this new feature, we can enhance scalability of log, special allocators can be created for purposes: - pure lfs allocator for aligned pinfile allocation or file defragmentation - pure ssr allocator for later feature So that, let's update aligned pinfile allocation to use this new inmem curseg fwk. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>