Age | Commit message (Collapse) | Author |
|
rq_timed_out_fn might have been unset while the request
was in flight, so we need to check for it in blk_rq_timed_out().
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
The DASD driver is using FASTFAIL as an equivalent to the
transport errors in SCSI. And the 'steal lock' function maps
roughly to a reservation error. So we should be returning the
appropriate error codes when completing a request.
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Whenever a request has been aborted internally by the driver
there is no sense data to be had. And printing lots of messages
stalls the system, so better to print out a short one-liner.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
This patch implements generic block layer timeout handling
callbacks for DASDs. When the timeout expires the respective
cqr is aborted.
With this timeout handler time-critical request abort
is guaranteed as the abort does not depend on the internal
state of the various DASD driver queues.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Originally the DASD device tasklet would process the entries on
the ccw_queue until the first non-final request was found.
Which was okay as long as all requests have the same retries and
expires parameter.
However, as we're now allowing to modify both it is possible to
have requests _after_ the first request which already have expired.
So we need to check all requests in the device tasklet.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Instead of having the number of retries hard-coded in the various
functions we should be using a default retry value, which can
be modified via sysfs.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
dasd_cancel_req will never return 1, only 0.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
There is a misleading naming of the program parameter fields, so
correct them according to their names as outlined in
"The Load-Program-Parameter and the CPU-measurement Facilities"
(SA23-2260-03).
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
This patch does two things. First we no longer explicitly read in the blocks
we're trying to readahead. For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there. Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go. Hopefully this will cut down on
the number of re-searches. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified. The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding. Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry. This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock. Fix this by unlocking the range. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item. This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL. I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
A user reported scrub taking up an unreasonable amount of ram as it ran. This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram. This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed. The user reported this to fix their problem. Thanks,
Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram. This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs. What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run. To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve. With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction. If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing. This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up. To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be. All the callers
will notice that the transaction has aborted and exit out properly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever. This
should fix the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks. The side effect of
setting last_snapshot is to break nocow attributes of files.
Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.
Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Both hole punch and truncate use ext4_ext_rm_leaf() for removing
blocks. Currently we choose the last extent as the starting
point for removing blocks:
ex = EXT_LAST_EXTENT(eh);
This is OK for truncate but for hole punch we can optimize the extent
selection as the path is already initialized. We could use this
information to select proper starting extent. The code change in this
patch will not affect truncate as for truncate path[depth].p_ext will
always be NULL.
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If jbd2_journal_restart() fails the handle will have been disconnected
from the current transaction. In this situation, the handle must not
be used for for any jbd2 function other than jbd2_journal_stop().
Enforce this with by treating a handle which has a NULL transaction
pointer as an aborted handle, and issue a kernel warning if
jbd2_journal_extent(), jbd2_journal_get_write_access(),
jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.
This commit also fixes a bug where jbd2_journal_stop() would trip over
a kernel jbd2 assertion check when trying to free an invalid handle.
Also move the responsibility of setting current->journal_info to
start_this_handle(), simplifying the three users of this function.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Younger Liu <younger.liu@huawei.com>
Cc: Jan Kara <jack@suse.cz>
|
|
Translate the bitfields used in various flags argument to strings to
make the tracepoint output more human-readable.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The function mpage_released_unused_page() must only be called once;
otherwise the kernel will BUG() when the second call to
mpage_released_unused_page() tries to unlock the pages which had been
unlocked by the first call.
Also restructure the error handling so that we only give up on writing
the dirty pages in the case of ENOSPC where retrying the allocation
won't help. Otherwise, a transient failure, such as a kmalloc()
failure in calling ext4_map_blocks() might cause us to give up on
those pages, leading to a scary message in /var/log/messages plus data
loss.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Once we decrement transaction->t_updates, if this is the last handle
holding the transaction from closing, and once we release the
t_handle_lock spinlock, it's possible for the transaction to commit
and be released. In practice with normal kernels, this probably won't
happen, since the commit happens in a separate kernel thread and it's
unlikely this could all happen within the space of a few CPU cycles.
On the other hand, with a real-time kernel, this could potentially
happen, so save the tid found in transaction->t_tid before we release
t_handle_lock. It would require an insane configuration, such as one
where the jbd2 thread was set to a very high real-time priority,
perhaps because a high priority real-time thread is trying to read or
write to a file system. But some people who use real-time kernels
have been known to do insane things, including controlling
laser-wielding industrial robots. :-)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Currently if we pass range into ext4_zero_partial_blocks() which covers
entire block we would attempt to zero it even though we should only zero
unaligned part of the block.
Fix this by checking whether the range covers the whole block skip
zeroing if so.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The function ext4_write_inline_data_end() can return an error. So we
need to assign it to a signed integer variable to check for an error
return (since copied is an unsigned int).
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
|
|
Comparing unsigned variable with 0 always returns false.
err = 0 is duplicated and unnecessary.
[ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]
Signed-off-by: "Jon Ernst" <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
in-core rbtree for use by call_filldir(). All updates of ->f_pos are
done by the latter; bumping it here (on error) is obviously wrong - we
might very well have it nowhere near the block we'd found an error in.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Some of the functions which modify the jbd2 superblock were not
updating the checksum before calling jbd2_write_superblock(). Move
the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
that the checksum is calculated consistently.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: stable@vger.kernel.org
|
|
No need to pass file pointer when we can directly pass inode pointer.
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In ext4 feature inline_data,it use the xattr's space to store the
inline data in inode.When we calculate the inline data as the xattr,we
add the pad.But in get_max_inline_xattr_value_size() function we count
the free space without pad.It cause some contents are moved to a block
even if it can be
stored in the inode.
Signed-off-by: liulei <lewis.liulei@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
|
|
Reduce the object size ~10% could be useful for embedded systems.
Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.
$ size fs/ext4/built-in.o*
text data bss dec hex filename
239375 610 888 240873 3ace9 fs/ext4/built-in.o.new
264167 738 888 265793 40e41 fs/ext4/built-in.o.old
$ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
# CONFIG_PRINTK is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_POSIX_ACL=y
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure. For
keeping this order, a spin lock is used to protect this list. But this
lock burns a lot of CPU time. We can use the following steps to trigger
it.
% cd /dev/shm
% dd if=/dev/zero of=ext4-img bs=1M count=2k
% mkfs.ext4 ext4-img
% mount -t ext4 -o loop ext4-img /mnt
% cd /mnt
% for ((i=0;i<160;i++)); do truncate -s 64g $i; done
% for ((i=0;i<160;i++)); do cp $i /dev/null &; done
% perf record -a -g
% perf report
This commit tries to fix this problem. Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode. Meanwhile we never need to keep a proper in-order
LRU list. So this can avoid to burns some CPU time. When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list. Then we traverse this list to discard some
entries. In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list. When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list. When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.
In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.
Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.
An observed result was:
- allocation fail means ext4_mb_new_group_pa() does not update
ext4_allocation_context;
- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
of number of blocks requested (1);
- that activates update cycle in ext4_splice_branch():
for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
*(where->p + i) = cpu_to_le32(current_block++);
- it iterates 511 times and corrupts a chunk of memory including inode
structure;
- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();
- system hangs with 'scheduling while atomic' BUG.
The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.
Found by Linux File System Verification project (linuxtesting.org).
[ Patch restructed by tytso to make the flow of control easier to follow. ]
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Subtracting the number of the first data block places the superblock
backups one block too early, corrupting the file system. When the block
size is larger than 1K, the first data block is 0, so the subtraction
has no effect and no corruption occurs.
Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
|
|
git://git.linaro.org/people/stevecapper/linux into upstream-hugepages
* 'for-next/hugepages' of git://git.linaro.org/people/stevecapper/linux:
ARM64: mm: THP support.
ARM64: mm: Raise MAX_ORDER for 64KB pages and THP.
ARM64: mm: HugeTLB support.
ARM64: mm: Move PTE_PROT_NONE bit.
ARM64: mm: Make PAGE_NONE pages read only and no-execute.
ARM64: mm: Restore memblock limit when map_mem finished.
mm: thp: Correct the HPAGE_PMD_ORDER check.
x86: mm: Remove general hugetlb code from x86.
mm: hugetlb: Copy general hugetlb code from x86 to mm.
x86: mm: Remove x86 version of huge_pmd_share.
mm: hugetlb: Copy huge_pmd_share from x86 to mm.
Conflicts:
arch/arm64/Kconfig
arch/arm64/include/asm/pgtable-hwdef.h
arch/arm64/include/asm/pgtable.h
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|