Age | Commit message (Collapse) | Author |
|
The newly added NFS v4.2 operations (ALLOCATE, DEALLOCATE, SEEK and CLONE)
use a helper called nfs42_set_rw_stateid to select a stateid that is sent
to the server. But they don't set the inode and state fields in the
nfs4_exception structure, and this don't partake in the stateid recovery
protocol. Because of this they will simply return errors insted of trying
to recover a stateid when the server return a BAD_STATEID error.
Additionally CLONE has the problem that it operates on two files and thus
two stateids, and thus needs to call the exception handler twice to
recover stateids.
While we're at it stop grabbing an addititional reference to the open
context in all these operations - having the file open guarantees that
the open context won't go away.
All this can be produces with the generic/168 and generic/170 tests in
xfstests which stress the CLONE stateid handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
In the case where d_add_unique() finds an appropriate alias to use it will
have already incremented the reference count. An additional dget() to swap
the open context's dentry is unnecessary and will leak a reference.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 275bb307865a3 ("NFSv4: Move dentry instantiation into the NFSv4-...")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
inode struct members that track cgroup writeback information
should be reinitialized when inode gets allocated from
kmem_cache. Otherwise, their values remain and get used by the
new inode.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull cifs fixes from Steve French:
"A small set of cifs fixes.
I am still reviewing some more, recently submitted SMB3 fixes, but
these three are small and safe and ready now"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix erroneous return value
cifs: fix potential overflow in cifs_compose_mount_options
cifs: remove redundant check for null string pointer
|
|
If cgroup writeback is in use, an inode is associated with a cgroup
for writeback. If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously. Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.
kernel BUG at fs/jbd2/transaction.c:319!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
Hardware name: Google Google, BIOS Google 01/01/2011
Workqueue: events inode_switch_wbs_work_fn
task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
RSP: 0018:ffff880209267c30 EFLAGS: 00010202
...
Call Trace:
[<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
[<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
[<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
[<ffffffff8035338b>] evict+0xbb/0x190
[<ffffffff80354190>] iput+0x130/0x190
[<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
[<ffffffff80279819>] process_one_work+0x129/0x300
[<ffffffff80279b16>] worker_thread+0x126/0x480
[<ffffffff8027ed14>] kthread+0xc4/0xe0
[<ffffffff809771df>] ret_from_fork+0x3f/0x70
Fix it by bumping s_active while cgroup association switching is in
flight.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
We can see following loop(10000 times) in trace_log:
[ 75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
...(10000 times)
[ 124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
Reason:
If more than one user trigger reada in same extent, the first task
finished setting of reada data struct and call reada_start_machine()
to start, and the second task only add a ref_count but have not
add reada_extctl struct completely, the reada_extent can not finished
all jobs, and will be selected in __reada_start_machine() for 10000
times(total times in __reada_start_machine()).
Fix:
For a reada_extent without job, we don't need to run it, just return
0 to let caller break.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In rechecking zone-in-tree, we still need to check zone include
our logical address.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
reada_zone->end is end pos of segment:
end = start + cache->key.offset - 1;
So we need to use "<=" in condition to judge is a pos in the
segment.
The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent
Pull EFI fixes from Matt Fleming:
* Prevent accidental deletion of EFI variables through efivarfs that
may brick machines. We use a whitelist of known-safe variables to
allow things like installing distributions to work out of the box, and
instead restrict vendor-specific variable deletion by making
non-whitelist variables immutable (Peter Jones)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.
This patch introduces a new get_user_pages() variant:
get_user_pages_remote()
Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.
We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.
The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.
Without protection keys, this patch should not change any behavior.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Provide a stable basis for the pkeys patches, which touches various
x86 details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When ext4_bread() fails, fname_crypto_str remains
allocated after return. Fix that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
|
|
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:
dio_end_io(dio_bio, bio->bi_error);
It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().
This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.
Cc: stable@vger.kernel.org # 4.3+
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
|
When setting the layout return mode, we must always also set the
NFS_LAYOUT_RETURN_REQUESTED flag to ensure that we send a layoutreturn.
Otherwise pnfs_error_mark_layout_for_return() could set the mode, but
fail to send the layoutreturn because another is already in flight.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We don't need to schedule a layoutreturn if the layout segment can
be freed immediately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Currently we can build a long ioend chain during ->writepages that
gets attached to the writepage context. IO submission only then
occurs when we finish all the writepage processing. This means we
can have many ioends allocated and pending, and this violates the
mempool guarantees that we need to give about forwards progress.
i.e. we really should only have one ioend being built at a time,
otherwise we may drain the mempool trying to allocate a new ioend
and that blocks submission, completion and freeing of ioends that
are already in progress.
To prevent this situation from happening, we need to submit ioends
for IO as soon as they are ready for dispatch rather than queuing
them for later submission. This means the ioends have bios built
immediately and they get queued on any plug that is current active.
Hence if we schedule away from writeback, the ioends that have been
built will make forwards progress due to the plug flushing on
context switch. This will also prevent context switches from
creating unnecessary IO submission latency.
We can't completely avoid having nested IO allocation - when we have
a block size smaller than a page size, we still need to hold the
ioend submission until after we have marked the current page dirty.
Hence we may need multiple ioends to be held while the current page
is completely mapped and made ready for IO dispatch. We cannot avoid
this problem - the current code already has this ioend chaining
within a page so we can mostly ignore that it occurs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Separate out the bufferhead based mapping from the writepage code so
that we have a clear separation of the page operations and the
bufferhead state.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_cluster_write() is not necessary now that xfs_vm_writepages()
aggregates writepage calls across a single mapping. This means we no
longer need to do page lookups in xfs_cluster_write, so writeback
only needs to look up th epage cache once per page being written.
This also removes a large amount of mostly duplicate code between
xfs_do_writepage() and xfs_convert_page().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_vm_writepages() calls generic_writepages to writeback a range of
a file, but then xfs_vm_writepage() clusters pages itself as it does
not have any context it can pass between->writepage calls from
__write_cache_pages().
Introduce a writeback context for xfs_vm_writepages() and call
__write_cache_pages directly with our own writepage callback so that
we can pass that context to each writepage invocation. This
encapsulates the current mapping, whether it is valid or not, the
current ioend and it's IO type and the ioend chain being built.
This requires us to move the ioend submission up to the level where
the writepage context is declared. This does mean we do not submit
IO until we packaged the entire writeback range, but with the block
plugging in the writepages call this is the way IO is submitted,
anyway.
It also means that we need to handle discontiguous page ranges. If
the pages sent down by write_cache_pages to the writepage callback
are discontiguous, we need to detect this and put each discontiguous
page range into individual ioends. This is needed to ensure that the
ioend accurately represents the range of the file that it covers so
that file size updates during IO completion set the size correctly.
Failure to take into account the discontiguous ranges results in
files being too small when writeback patterns are non-sequential.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We currently have code to cancel ioends being built because we
change bufferhead state as we build the ioend. On error, this needs
to be unwound and so we have cancelling code that walks the buffers
on the ioend chain and undoes these state changes.
However, the IO submission path already handles state changes for
buffers when a submission error occurs, so we don't really need a
separate cancel function to do this - we can simply submit the
ioend chain with the specific error and it will be cancelled rather
than submitted.
Hence we can remove the explicit cancel code and just rely on
submission to deal with the error correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Remove the nonblocking optimisation done for mapping lookups during
writeback. It's not clear that leaving a hole in the writeback range
just because we couldn't get a lock is really a win, as it makes us
do another small random IO later on rather than a large sequential
IO now.
As this gets in the way of sane error handling later on, just remove
for the moment and we can re-introduce an equivalent optimisation in
future if we see problems due to extent map lock contention.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We want the fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We want those fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are a number of small tty and serial driver fixes for 4.5-rc4
that resolve some reported issues.
One of them got reverted as it wasn't correct based on testing, and
all have been in linux-next for a while"
* tag 'tty-4.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "8250: uniphier: allow modular build with 8250 console"
pty: make sure super_block is still valid in final /dev/tty close
pty: fix possible use after free of tty->driver_data
tty: Add support for PCIe WCH382 2S multi-IO card
serial/omap: mark wait_for_xmitr as __maybe_unused
serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485)
8250: uniphier: allow modular build with 8250 console
tty: Drop krefs for interrupted tty lock
|
|
it's always equal to __orangefs_bufmap and the latter can't change
until we are done
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
the second caller never needs to cancel, actually
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This has a few fixes from Filipe, along with a readdir fix from Dave
that we've been testing for some time"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: properly set the termination value of ctx->pos in readdir
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
Btrfs: remove no longer used function extent_read_full_page_nolock()
Btrfs: fix page reading in extent_same ioctl leading to csum errors
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fix from Dve Chinner:
"This contains a fix for an endian conversion issue in new CRC
validation in log recovery that was discovered on a ppc64 platform"
* tag 'xfs-fixes-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: fix endianness error when checking log block crc on big endian platforms
|
|
Introduce new mount option alias "norecovery" for nologreplay, to keep
"norecovery" behavior the same with other filesystems.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.
Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.
Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.
Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.
This provides the basis for later unified "norecovery" mount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The "newblock" parameter is not used in convert_initialized_extent(),
remove it.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
I notice ext4/307 fails occasionally on ppc64 host, reporting md5
checksum mismatch after moving data from original file to donor file.
The reason is that move_extent_per_page() calls __block_write_begin()
and block_commit_write() to write saved data from original inode blocks
to donor inode blocks, but __block_write_begin() not only maps buffer
heads but also reads block content from disk if the size is not block
size aligned. At this time the physical block number in mapped buffer
head is pointing to the donor file not the original file, and that
results in reading wrong data to page, which get written to disk in
following block_commit_write call.
This also can be reproduced by the following script on 1k block size ext4
on x86_64 host:
mnt=/mnt/ext4
donorfile=$mnt/donor
testfile=$mnt/testfile
e4compact=~/xfstests/src/e4compact
rm -f $donorfile $testfile
# reserve space for donor file, written by 0xaa and sync to disk to
# avoid EBUSY on EXT4_IOC_MOVE_EXT
xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
# create test file written by 0xbb
xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
# compute initial md5sum
md5sum $testfile | tee md5sum.txt
# drop cache, force e4compact to read data from disk
echo 3 > /proc/sys/vm/drop_caches
# test defrag
echo "$testfile" | $e4compact -i -v -f $donorfile
# check md5sum
md5sum -c md5sum.txt
Fix it by creating & mapping buffer heads only but not reading blocks
from disk, because all the data in page is guaranteed to be up-to-date
in mext_page_mkuptodate().
Cc: stable@vger.kernel.org
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
integer overflow could be happened.
Therefore, need to fix integer overflow sanitization.
Cc: stable@vger.kernel.org
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch adds a line break for proc mb_groups display.
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
|
The ext4_ioctl_setflags() function which is used in the ioctls
EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
EPERM instead of -EPERM in case of error. This bug was introduced by a
recent commit 9b7365fc.
The following program can be used to illustrate the wrong behavior:
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <err.h>
#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
#define FS_IMMUTABLE_FL 0x00000010
int main(void)
{
int fd;
long flags;
fd = open("file", O_RDWR|O_CREAT, 0600);
if (fd < 0)
err(1, "open");
if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
err(1, "ioctl: FS_IOC_GETFLAGS");
flags |= FS_IMMUTABLE_FL;
if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
err(1, "ioctl: FS_IOC_SETFLAGS");
warnx("ioctl returned no error");
return 0;
}
Running it gives the following result:
$ strace -e ioctl ./test
ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
test: ioctl returned no error
+++ exited with 0 +++
Running the program on a kernel with the bug fixed gives the proper result:
$ strace -e ioctl ./test
ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
+++ exited with 1 +++
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When block group checksum is wrong, we call ext4_error() while holding
group spinlock from ext4_init_block_bitmap() or
ext4_init_inode_bitmap() which results in scheduling while in atomic.
Fix the issue by calling ext4_error() later after dropping the spinlock.
CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.
Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension. The
victim is the BTRFS_DEV_STATS_KEY (248).
The device stats are an example of a permanent item.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
No visible change.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension. The victim is the BTRFS_BALANCE_ITEM_KEY (248).
The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.
There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.
We can get to that situation like that:
* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.
The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:
Overflow: e
Overflow: 7fffffff
Overflow: 7fffffffffffffff
PAX: size overflow detected in function btrfs_real_readdir
fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
context: dir_context;
CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
Call Trace:
[<ffffffff81742f0f>] dump_stack+0x4c/0x7f
[<ffffffff811cb706>] report_size_overflow+0x36/0x40
[<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
[<ffffffff811dafc8>] iterate_dir+0xa8/0x150
[<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
[<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
Overflow: 1a
[<ffffffff811db070>] ? iterate_dir+0x150/0x150
[<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.
The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.
References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
|