Age | Commit message (Collapse) | Author |
|
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases. This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block. This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries. For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4. This allows us to have strict DAX vs non-DAX paths in the
writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.
This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When testing with fsstress, kworker and user threads were both blocked:
INFO: task kworker/u16:1:16580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-251:0)
ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440
ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440
ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000
Call Trace:
[<ffffffff816fe9d9>] schedule+0x29/0x70
[<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9
[<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs]
[<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs]
[<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs]
[<ffffffff811473b3>] do_writepages+0x23/0x40
[<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250
[<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470
[<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0
[<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220
[<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0
[<ffffffff8106b74e>] process_one_work+0x1de/0x5b0
[<ffffffff8106e78f>] worker_thread+0x11f/0x3e0
[<ffffffff810750ce>] kthread+0xde/0xf0
[<ffffffff817093f8>] ret_from_fork+0x58/0x90
fsstress thread stack:
[<ffffffff81139f0e>] sleep_on_page+0xe/0x20
[<ffffffff81139ef7>] __lock_page+0x67/0x70
[<ffffffff8113b100>] find_lock_page+0x50/0x80
[<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0
[<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs]
[<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs]
[<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs]
[<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs]
[<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40
[<ffffffff811d304c>] vfs_fsync+0x1c/0x20
[<ffffffff811d3291>] do_fsync+0x41/0x70
[<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20
[<ffffffff817094a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
The reason of this issue is:
CPU0: CPU1:
- f2fs_write_data_pages
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- down_write(sbi->cp_rwsem)
- lock_page(page)
- f2fs_write_data_page
- sync_node_pages
- flush_inline_data
- pagecache_get_page(page, GFP_LOCK)
- f2fs_lock_op
- down_read(sbi->cp_rwsem)
This patch alters to use trylock_page in flush_inline_data to fix this ABBA
deadlock issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add a new helper f2fs_flush_merged_bios to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add a new help f2fs_update_data_blkaddr to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For now, flow of GCing an encrypted data page:
1) try to grab meta page in meta inode's mapping with index of old block
address of that data page
2) load data of ciphertext into meta page
3) allocate new block address
4) write the meta page into new block address
5) update block address pointer in direct node page.
Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to
check and wait on GCed encrypted data cached in meta page writebacked
in order to avoid inconsistence among data page cache, meta page cache
and data on-disk when updating.
However, we will use new block address updated in step 5) as an index to
lookup meta page in inner bio buffer. That would be wrong, and we will
never find the GCing meta page, since we use the old block address as
index of that page in step 1).
This patch fixes the issue by adjust the order of step 1) and step 3),
and in step 1) grab page with index generated in step 3).
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Al Viro has cleaned up the way ops are processed and waited for,
now orangefs.txt has an overview of how it works. Several recent
related commits have added to the comments in the code as well.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
orangefs contains a helper function to calculate the difference
between two timeval structures. We are trying to remove all
instances of timespec from the kernel, and this one is not
used at all, so let's remove it now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
The new orangefs code uses a helper function to read a time field to
its private structures from struct iattr. This will conflict with the
move to 64-bit timestamps in the kernel and is generally not necessary.
This replaces the conversion with a simple cast to time64_t that shows
what is going on. As the orangefs-internal representation already uses
64-bit timestamps, there should be no ambiguity to negative values,
and the cast ensures that we treat them as times before 1970 on both
32-bit and 64-bit architectures, rather than times after 2038. This
patch keeps that behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
|
|
# Conflicts:
# fs/btrfs/file.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Conflicts:
# fs/btrfs/file.c
|
|
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
result in miss hitting in page cache.
2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
as range for waiting on writeback, big number of pages will not be covered.
This patch corrects above two issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When a directory is deleted, we don't take too much care about killing off
all the dirents that belong to it — on the basis that on remount, the scan
will conclude that the directory is dead anyway.
This doesn't work though, when the deleted directory contained a child
directory which was moved *out*. In the early stages of the fs build
we can then end up with an apparent hard link, with the child directory
appearing both in its true location, and as a child of the original
directory which are this stage of the mount process we don't *yet* know
is defunct.
To resolve this, take out the early special-casing of the "directories
shall not have hard links" rule in jffs2_build_inode_pass1(), and let the
normal nlink processing happen for directories as well as other inodes.
Then later in the build process we can set ic->pino_nlink to the parent
inode#, as is required for directories during normal operaton, instead
of the nlink. And complain only *then* about hard links which are still
in evidence even after killing off all the unreachable paths.
Reported-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
|
|
With this fix, all code paths should now be obtaining the page lock before
f->sem.
Reported-by: Szabó Tamás <sztomi89@gmail.com>
Tested-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
|
|
This reverts commit 5ffd3412ae55
("jffs2: Fix lock acquisition order bug in jffs2_write_begin").
The commit modified jffs2_write_begin() to remove a deadlock with
jffs2_garbage_collect_live(), but this introduced new deadlocks found
by multiple users. page_lock() actually has to be called before
mutex_lock(&c->alloc_sem) or mutex_lock(&f->sem) because
jffs2_write_end() and jffs2_readpage() are called with the page locked,
and they acquire c->alloc_sem and f->sem, resp.
In other words, the lock order in jffs2_write_begin() was correct, and
it is the jffs2_garbage_collect_live() path that has to be changed.
Revert the commit to get rid of the new deadlocks, and to clear the way
for a better fix of the original deadlock.
Reported-by: Deng Chao <deng.chao1@zte.com.cn>
Reported-by: Ming Liu <liu.ming50@gmail.com>
Reported-by: wangzaiwei <wangzaiwei@top-vision.cn>
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
|
|
Size and type are read-only and not in the mask. The times were left
unset despite being in the mask.
We zero-fill the times since the server will fill them in and we will
get the correct time when we fill the inode with getattr.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
I have verified that there is nothing in the userspace daemon version we
are implementing this protocol against that ever looks at this field.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
We only need it while the service operation is actually in progress
since it is only used to co-ordinate the client-core's memory use. The
kernel allocates its own space.
Also clean up some comments which mislead the reader into thinking
the readdir buffers are shared memory.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Assorted fixes - xattr one from this cycle, the rest - stable fodder"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/pnode.c: treat zero mnt_group_id-s as unequal
affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()
xattr handlers: plug a lock leak in simple_xattr_list
fs: allow no_seek_end_llseek to actually seek
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Stable bugfixes:
- Fix nfs_size_to_loff_t
- NFSv4: Fix a dentry leak on alias use
Other bugfixes:
- Don't schedule a layoutreturn if the layout segment can be freed
immediately.
- Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
- rpcrdma_bc_receive_call() should init rq_private_buf.len
- fix stateid handling for the NFS v4.2 operations
- pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
- fix panic in gss_pipe_downcall() in fips mode
- Fix a race between layoutget and pnfs_destroy_layout
- Fix a race between layoutget and bulk recalls"
* tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
auth_gss: fix panic in gss_pipe_downcall() in fips mode
pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
nfs4: fix stateid handling for the NFS v4.2 operations
NFSv4: Fix a dentry leak on alias use
xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
pNFS: Fix pnfs_mark_matching_lsegs_return()
nfs: fix nfs_size_to_loff_t
|
|
The D state of wait_on_all_pages_writeback should be waken by
function f2fs_write_end_io when all writeback pages have been
succesfully written to device. It's possible that wake_up comes
between get_pages and io_schedule. Maybe in this case it will
lost wake_up and still in D state even if all pages have been
write back to device, and finally, the whole system will be into
the hungtask state.
if (!get_pages(sbi, F2FS_WRITEBACK))
break;
<--------- wake_up
io_schedule();
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Biao He <hebiao6@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The control device is accessible when no filesystem is mounted and we
may want to query features supported by the module. This is already
possible using the sysfs files, this ioctl is for parity and
convenience.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current practical default is ~4k on x86_64 (the logic is more complex,
simplified for brevity), the inlined files land in the metadata group and
thus consume space that could be needed for the real metadata.
The inlining brings some usability surprises:
1) total space consumption measured on various filesystems and btrfs
with DUP metadata was quite visible because of the duplicated data
within metadata
2) inlined data may exhaust the metadata, which are more precious in case
the entire device space is allocated to chunks (ie. balance cannot
make the space more compact)
3) performance suffers a bit as the inlined blocks are duplicate and
stored far away on the device.
Proposed fix: set the default to 2048
This fixes namely 1), the total filesysystem space consumption will be on
par with other filesystems.
Partially fixes 2), more data are pushed to the data block groups.
The characteristics of 3) are based on actual small file size
distribution.
The change is independent of the metadata blockgroup type (though it's
most visible with DUP) or system page size as these parameters are not
trival to find out, compared to file size.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Let's remove the error message that appears when the tree_id is not
present. This can happen with the quota tree and has been observed in
practice. The applications are supposed to handle -ENOENT and we don't
need to report that in the system log as it's not a fatal error.
Reported-by: Vlastimil Babka <vbabka@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:
fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae719 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This patch enables to trace old block address of CoWed page for better
debugging.
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When flushing node pages, if current node page is an inline inode page, we
will try to merge inline data from data page into inline inode page, then
skip flushing current node page, it will decrease the number of nodes to
be flushed in batch in this round, which may lead to worse performance.
This patch gives a chance to flush just merged inline inode pages for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch changes to show more info in message log about the recovery
of the corrupted superblock during ->mount, e.g. the index of corrupted
superblock and the result of recovery.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
With a partition which was formated as multi segments in one section,
we stated incorrectly for count of gc operation.
e.g., for a partition with segs_per_sec = 4
cat /sys/kernel/debug/f2fs/status
GC calls: 208 (BG: 7)
- data segments : 104 (52)
- node segments : 104 (24)
GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52,
rather than 208. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|