Age | Commit message (Collapse) | Author |
|
Instead of poking deeply into buffer cache internals when re-reading the
superblock during log recovery just generalize _xfs_buf_read and use it
there. Note that we don't have to explicitly set up the ops as they
must be set from the initial read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The log recovery I/O completion handler does not substancially differ from
the normal one except for the fact that it:
a) never retries failed writes
b) can have log items that aren't on the AIL
c) never has inode/dquot log items attached and thus don't need to
handle them
Add conditionals for (a) and (b) to the ioend code, while (c) doesn't
need special handling anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Clear the flags at the end of xfs_buf_ioend so that they can be used
during the completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reuse xfs_buf_item_relse instead of duplicating it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Now that all the actual error handling is in a single place,
xfs_buf_ioend_disposition just needs to return true if took ownership of
the buffer, or false if not instead of the tristate. Also move the
error check back in the caller to optimize for the fast path, and give
the function a better fitting name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Keep all the error handling code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Merge xfs_buf_ioerror_retry into its only caller to make the resubmission
flow a little easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
xfs_buf_ioerror_fail_without_retry is a somewhat weird function in
that it has two trivial checks that decide the return value, while
the rest implements a ratelimited warning. Just lift the two checks
into the caller, and give the remainder a suitable name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
No need to keep a separate helper for this logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Move the buffer retry state machine logic to xfs_buf.c and call it once
from xfs_ioend instead of duplicating it three times for the three kinds
of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Move the log recovery I/O completion handling entirely into the log
recovery code, and re-arrange the normal I/O completion handler flow
to prepare to lifting more logic into common code in the next commits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Handle the no-error case in xfs_buf_iodone_error as well, and to clarify
the code rename the function, use the actual enum type as return value
and then switch on it in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
|
|
NVMe Zoned Namespace introduced the concept of active zones, which are
zones in the implicit open, explicit open or closed condition. Drives may
have a limit on the number of zones that can be simultaneously active.
This potential limitation translate into a risk for applications to see
write IO errors due to this limit if the zone of a file being written to is
not already active when a write request is issued.
To avoid these potential errors, the zone of a file can explicitly be made
active using an open zone command when the file is open for the first
time. If the zone open command succeeds, the application is then
guaranteed that write requests can be processed. This indirect management
of active zones relies on the maximum number of open zones of a drive,
which is always lower or equal to the maximum number of active zones.
On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN
command to the block device. Conversely, on the last release of a zone
file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or
empty.
As truncating a zone file to 0 or max can deactivate a zone as well, we
need to serialize against truncates and also be careful not to close a
zone as the file may still be open for writing, e.g. the user called
ftruncate(). If the zone file is not open and a process does a truncate(),
then no close operation is needed.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
|
|
Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex
already held, so factor out the body of zonefs_io_error() into
__zonefs_io_error() which can be called from with the i_truncate_mutex
held.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
|
|
Introduce a helper function for sending zone management commands to the
block device.
As zone management commands can change a zone write pointer position
reflected in the size of the zone file, this function expects the truncate
mutex to be held.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One of the recent lockdep fixes introduced a bug that breaks the
search ioctl, which is used by some applications (bees, compsize). The
patch made it to stable trees so we need this fixup to make it work
again"
* tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix wrong address when faulting in pages in the search ioctl
|
|
After commit 0b6d4ca04a86 ("f2fs: don't return vmalloc() memory from
f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed
memory, so clean up to use kfree() instead of kvfree() to free
vmalloc()'ed memory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This isn't safe, and isn't needed either. We are guaranteed that any
work we queue is on a live task (and will be run), or it goes to
our backup io-wq threads if the task is exiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If task_work ends up being marked for cancelation, we go through a
cancelation helper instead of the queue path. In converting task_work to
always hold a ctx reference, this path was missed. Make sure that
io_req_task_cancel() puts the reference that is being held against the
ctx.
Fixes: 6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When faulting in the pages for the user supplied buffer for the search
ioctl, we are passing only the base address of the buffer to the function
fault_in_pages_writeable(). This means that after the first iteration of
the while loop that searches for leaves, when we have a non-zero offset,
stored in 'sk_offset', we try to fault in a wrong page range.
So fix this by adding the offset in 'sk_offset' to the base address of the
user supplied buffer when calling fault_in_pages_writeable().
Several users have reported that the applications compsize and bees have
started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix
potential deadlock in the search ioctl") was added to stable trees, and
these applications make heavy use of the search ioctls. This fixes their
issues.
Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
Link: https://github.com/kilobyte/compsize/issues/34
Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl")
CC: stable@vger.kernel.org # 4.4+
Tested-by: A L <mail@lechevalier.se>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fixes the following W=1 kernel build warning(s):
fs/ext2/balloc.c:203: warning: Excess function parameter 'rb_root' description in '__rsv_window_dump'
fs/ext2/balloc.c:294: warning: Excess function parameter 'rb_root' description in 'search_reserve_window'
fs/ext2/balloc.c:878: warning: Excess function parameter 'rsv' description in 'alloc_new_reservation'
Link: https://lore.kernel.org/r/20200911114036.60616-1-wanghai38@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Always grab work environment for deferred links. The assumption that we
will be running it always from the task in question is false, as exiting
tasks may mean that we're deferring this one to a thread helper. And at
that point it's too late to grab the work environment.
Fixes: debb85f496c9 ("io_uring: factor out grab_env() from defer_prep()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core and debugfs fixes for 5.9-rc5
Included in here are:
- firmware loader memory leak fix
- firmware loader testing fixes for non-EFI systems
- device link locking fixes found by lockdep
- kobject_del() bugfix that has been affecting some callers
- debugfs minor fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
test_firmware: Test platform fw loading on non-EFI systems
PM: <linux/device.h>: fix @em_pd kernel-doc warning
kobject: Drop unneeded conditional in __kobject_del()
driver core: Fix device_pm_lock() locking for device links
MAINTAINERS: Add the security document to SECURITY CONTACT
driver code: print symbolic error code
debugfs: Fix module state check condition
kobject: Restore old behaviour of kobject_del(NULL)
firmware_loader: fix memory leak for paged buffer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- regression fix for a crash after failed snapshot creation
- one more lockep fix: use nofs allocation when allocating missing
device
- fix reloc tree leak on degraded mount
- make some extent buffer alignment checks less strict to mount
filesystems created by btrfs-convert"
* tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix NULL pointer dereference after failure to create snapshot
btrfs: free data reloc tree on failed mount
btrfs: require only sector size alignment for parent eb bytenr
btrfs: fix lockdep splat in add_missing_dev
|
|
Pull cifs fix from Steve French:
"A fix for lookup on DFS link when cifsacl or modefromsid is used"
* tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix DFS mount with cifsacl/modefromsid
|
|
By profiling f2fs compression works, I've found vmap() callings have
unexpected hikes in the execution time in our test environment and
those are bottlenecks of f2fs decompression path. Changing these with
vm_map_ram(), we can enhance f2fs decompression speed pretty much.
[Verification]
Android Pixel 3(ARM64, 6GB RAM, 128GB UFS)
Turned on only 0-3 little cores(at 1.785GHz)
dd if=/dev/zero of=dummy bs=1m count=1000
echo 3 > /proc/sys/vm/drop_caches
dd if=dummy of=/dev/zero bs=512k
- w/o compression -
1048576000 bytes (0.9 G) copied, 2.082554 s, 480 M/s
1048576000 bytes (0.9 G) copied, 2.081634 s, 480 M/s
1048576000 bytes (0.9 G) copied, 2.090861 s, 478 M/s
- before patch -
1048576000 bytes (0.9 G) copied, 7.407527 s, 135 M/s
1048576000 bytes (0.9 G) copied, 7.283734 s, 137 M/s
1048576000 bytes (0.9 G) copied, 7.291508 s, 137 M/s
- after patch -
1048576000 bytes (0.9 G) copied, 1.998959 s, 500 M/s
1048576000 bytes (0.9 G) copied, 1.987554 s, 503 M/s
1048576000 bytes (0.9 G) copied, 1.986380 s, 503 M/s
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The returned integer is not required anywhere. So we need to change
the return value to bool type.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
writepages() can be concurrently invoked for the same file by different
threads such as a thread fsyncing the file and a kworker kernel thread.
So, changing i_compr_blocks without protection is racy and we need to
protect it by changing it with atomic type value. Plus, we don't need
a 64bit value for i_compr_blocks, so just we will use a atomic value,
not atomic64.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
to keep consistent with behavior when passing compress mount option
to kernel w/o compression feature, so that mount may not fail on
such condition.
Reported-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
As 5kft <5kft@5kft.org> reported:
kworker/u9:3: page allocation failure: order:9, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 3 PID: 8168 Comm: kworker/u9:3 Tainted: G C 5.8.3-sunxi #trunk
Hardware name: Allwinner sun8i Family
Workqueue: f2fs_post_read_wq f2fs_post_read_work
[<c010d6d5>] (unwind_backtrace) from [<c0109a55>] (show_stack+0x11/0x14)
[<c0109a55>] (show_stack) from [<c056d489>] (dump_stack+0x75/0x84)
[<c056d489>] (dump_stack) from [<c0243b53>] (warn_alloc+0xa3/0x104)
[<c0243b53>] (warn_alloc) from [<c024473b>] (__alloc_pages_nodemask+0xb87/0xc40)
[<c024473b>] (__alloc_pages_nodemask) from [<c02267c5>] (kmalloc_order+0x19/0x38)
[<c02267c5>] (kmalloc_order) from [<c02267fd>] (kmalloc_order_trace+0x19/0x90)
[<c02267fd>] (kmalloc_order_trace) from [<c047c665>] (zstd_init_decompress_ctx+0x21/0x88)
[<c047c665>] (zstd_init_decompress_ctx) from [<c047e9cf>] (f2fs_decompress_pages+0x97/0x228)
[<c047e9cf>] (f2fs_decompress_pages) from [<c045d0ab>] (__read_end_io+0xfb/0x130)
[<c045d0ab>] (__read_end_io) from [<c045d141>] (f2fs_post_read_work+0x61/0x84)
[<c045d141>] (f2fs_post_read_work) from [<c0130b2f>] (process_one_work+0x15f/0x3b0)
[<c0130b2f>] (process_one_work) from [<c0130e7b>] (worker_thread+0xfb/0x3e0)
[<c0130e7b>] (worker_thread) from [<c0135c3b>] (kthread+0xeb/0x10c)
[<c0135c3b>] (kthread) from [<c0100159>]
zstd may allocate large size memory for {,de}compression, it may cause
file copy failure on low-end device which has very few memory.
For decompression, let's just allocate proper size memory based on current
file's cluster size instead of max cluster size.
Reported-by: 5kft <5kft@5kft.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Current compr_blocks of superblock info is not 64bit value. We are
accumulating each i_compr_blocks count of inodes to this value and
those are 64bit values. So, need to change this to 64bit value.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Need to add block address range check to compressed file case and
avoid calling get_data_block_bmap() for compressed file.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When the move range ioctl is used, check the input and output position and
ensure that it is a non-negative value. Without this check
f2fs_get_dnode_of_data may hit a memmory bug.
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Miss to update APP_DIRECT_IO/APP_DIRECT_READ_IO when receiving async DIO.
For example: fio -filename=/data/test.0 -bs=1m -ioengine=libaio -direct=1
-name=fill -size=10m -numjobs=1 -iodepth=32 -rw=write
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Instead of finding the first dirty page and then seeing if it matches
the index of a block that is NEW_ADDR, delay the lookup of the dirty
bit until we've actually found a block that's NEW_ADDR.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Checking for the lack of epitems refering to the epoll we want to insert into
is not enough; we might have an insertion of that epoll into another one that
has already collected the set of files to recheck for excessive reverse paths,
but hasn't gotten to creating/inserting the epitem for it.
However, any such insertion in progress can be detected - it will update the
generation count in our epoll when it's done looking through it for files
to check. That gets done under ->mtx of our epoll and that allows us to
detect that safely.
We are *not* holding epmutex here, so the generation count is not stable.
However, since both the update of ep->gen by loop check and (later)
insertion into ->f_ep_link are done with ep->mtx held, we are fine -
the sequence is
grab epmutex
bump loop_check_gen
...
grab tep->mtx // 1
tep->gen = loop_check_gen
...
drop tep->mtx // 2
...
grab tep->mtx // 3
...
insert into ->f_ep_link
...
drop tep->mtx // 4
bump loop_check_gen
drop epmutex
and if the fastpath check in another thread happens for that
eventpoll, it can come
* before (1) - in that case fastpath is just fine
* after (4) - we'll see non-empty ->f_ep_link, slow path
taken
* between (2) and (3) - loop_check_gen is stable,
with ->mtx providing barriers and we end up taking slow path.
Note that ->f_ep_link emptiness check is slightly racy - we are protected
against insertions into that list, but removals can happen right under us.
Not a problem - in the worst case we'll end up taking a slow path for
no good reason.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This switches f2fs over to the generic support provided in
the previous patch.
Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This adds general supporting functions for filesystems that use
utf8 casefolding. It provides standard dentry_operations and adds the
necessary structures in struct super_block to allow this standardization.
The new dentry operations are functionally equivalent to the existing
operations in ext4 and f2fs, apart from the use of utf8_casefold_hash to
avoid an allocation.
By providing a common implementation, all users can benefit from any
optimizations without needing to port over improvements.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This adds a case insensitive hash function to allow taking the hash
without needing to allocate a casefolded copy of the string.
The existing d_hash implementations for casefolding allocate memory
within rcu-walk, by avoiding it we can be more efficient and avoid
worrying about a failed allocation.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
refcount_t type variable should never be less than one, so it's a
little bit hard to understand when we use it to indicate pending
compressed page count, let's change to use atomic_t for better
readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch fixes below compile warning reported by LKP
(kernel test robot)
cppcheck warnings: (new ones prefixed by >>)
>> fs/f2fs/file.c:761:9: warning: Identical condition 'err', second condition is always false [identicalConditionAfterEarlyExit]
return err;
^
fs/f2fs/file.c:753:6: note: first condition
if (err)
^
fs/f2fs/file.c:761:9: note: second condition
return err;
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
then, we can add specified entry into rb-tree with 64-bits segment time
as key.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Don't let f2fs inner GC ruins original aging degree of segment.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Previously, once we update one block in segment, we will update mtime of
segment to last time, making aged segment becoming freshest, result in
that GC with cost benefit algorithm missing such segment, So this patch
changes to record mtime as average block updating time instead of last
updating time.
It's not needed to reset mtime for prefree segment, as se->valid_blocks
is zero, then old se->mtime won't take any weight with below calculation:
se->mtime = div_u64(se->mtime * se->valid_blocks + mtime,
se->valid_blocks + 1);
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;
In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
persisted into checkpoint area;
With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature
So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|