Age | Commit message (Collapse) | Author |
|
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: slab: free kmem_cache_node after destroy sysfs file
ipc/shm: handle removed segments gracefully in shm_mmap()
MAINTAINERS: update Kselftest Framework mailing list
devm_memremap_release(): fix memremap'd addr handling
mm/hugetlb.c: fix incorrect proc nr_hugepages value
mm, x86: fix pte_page() crash in gup_pte_range()
fsnotify: turn fsnotify reaper thread into a workqueue job
Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
mm: fix regression in remap_file_pages() emulation
thp, dax: do not try to withdraw pgtable from non-anon VMA
|
|
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.
Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
Fix the problem by always updating bh->b_state bits atomically.
CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We don't require a dedicated thread for fsnotify cleanup. Switch it
over to a workqueue job instead that runs on the system_unbound_wq.
In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit c510eff6beba ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").
Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.
The above commit changed the code to use call_srcu to clean up the
marks. While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.
There's also another potential problem with using call_srcu here. While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.
It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys. While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.
This patch reverts the above patch. A later patch will take a different
approach to eliminated the dedicated thread here.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When starting up linux with btrfs filesystem, I got many memory leak
messages by kmemleak as,
unreferenced object 0xffff880066882000 (size 4096):
comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0
[<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs]
[<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
[<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs]
[<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs]
[<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
[<ffffffff811765aa>] do_init_module+0x5e/0x1e9
[<ffffffff810fec09>] load_module+0x20a9/0x2690
[<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
[<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800573f8000 (size 10256):
comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70
[<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90
[<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs]
[<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
[<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs]
[<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs]
[<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs]
[<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
[<ffffffff811765aa>] do_init_module+0x5e/0x1e9
[<ffffffff810fec09>] load_module+0x20a9/0x2690
[<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
[<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
This patch lets btrfs using fs_info stored in btrfs_root for
block group cache directly without allocating a new one.
Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs failed in xfstests btrfs/080 with -o nodatacow.
Can be reproduced by following script:
DEV=/dev/vdg
MNT=/mnt/tmp
umount $DEV &>/dev/null
mkfs.btrfs -f $DEV
mount -o nodatacow $DEV $MNT
dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
btrfs subvolume snapshot -r $MNT $MNT/test_snap &
wait
--
We can see dd failed on NO_SPACE.
Reason:
__btrfs_buffered_write should run cow write when no_cow impossible,
and current code is designed with above logic.
But check_can_nocow() have 2 type of return value(0 and <0) on
can_not_no_cow, and current code only continue write on first case,
the second case happened in doing subvolume.
Fix:
Continue write when check_can_nocow() return 0 and <0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
|
|
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We were getting build warning about:
fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
uninitialized in this function
It is not a valid warning as used_bg is never used uninitilized since
locked is initially false so we can never be in the section where
'used_bg' is used. But gcc is not able to understand that and we can
initialize it while declaring to silence the warning.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use the private member of extent_state to store the failrec and play
pointless pointer games.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The kernel provides a swap() that does the same thing as this code.
Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create(). So I
remove the first one.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Simplify expression in btrfs_calc_trans_metadata_size().
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space. We don't want to do this however when we are actually using
this space as it causes unneeded thrashing. We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.
My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027
No Patch Patch
avg: 5385 5009
median: 5500 4916
We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point. With these added my tool no longer sees transaction
leaks.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count. We need a tracepoint here
so that we have the matching reserve for the release that will come later. Also
add a comment to make clear what the intent of truncate_space_check is.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I'm writing a tool to visualize the enospc system in order to help debug enospc
bugs and I found weird data and ran it down to when we update the global block
rsv. We add all of the remaining free space to the block rsv, do a trace event,
then remove the extra and do another trace event. This makes my visualization
look silly and is unintuitive code as well. Fix this stuff to only add the
amount we are missing, or free the amount we are missing. This is less clean to
read but more explicit in what it is doing, as well as only emitting events for
values that make sense. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For a non-existent device, old code bypasses adding it in dev's reada
queue.
And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59fd22 ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.
Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case
And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.
Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
It reduced no-thread window time.
But after we fixed above case, the "undone reada extents" frequently
happened.
Fix:
Check to ensure we have at least one thread if there are undone jobs
in btrfs_reada_wait().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Reada creates 2 works for each level of tree recursively.
In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.
The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
No need to decrease dev->reada_in_flight in __readahead_hook()'s
internal and reada_extent_put().
reada_extent_put() have no chance to decrease dev->reada_in_flight
in free operation, because reada_extent have additional refcnt when
scheduled to a dev.
We can put inc and dec operation for dev->reada_in_flight to one
place instead to make logic simple and safe, and move useless
reada_extent->scheduled_for to a bool flag instead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove one copy of loop to fix the typo of iterate zones.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Current code set nritems to 0 to make for_loop useless to bypass it,
and set generation's value which is not necessary.
Jump into cleanup directly is better choise.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
reada_start_machine_dev() already have reada_extent pointer, pass
it into __readahead_hook() directly instead of search radix_tree
will make code run faster.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can't release reada_extent earlier than __readahead_hook(), because
__readahead_hook() still need to use it, it is necessary to hode a refcnt
to avoid it be freed.
Actually it is not a problem after my patch named:
Avoid many times of empty loop
It make reada_extent in above line include at least one reada_extctl,
which keeps additional one refcnt for reada_extent.
But we still need this patch to make the code in pretty logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
level is not used in severial functions, remove them from arguments,
and remove relative code for get its value.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When failed adding all dev_zones for a reada_extent, the extent
will have no chance to be selected to run, and keep in memory
for ever.
We should bypass this extent to avoid above case.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If some device is not reachable, we should bypass and continus addingb
next, instead of break on bad device.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move is_need_to_readahead contition earlier to avoid useless loop
to get relative data for readahead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull block fixes from Jens Axboe:
"A collection of fixes from the past few weeks that should go into 4.5.
This contains:
- Overflow fix for sysfs discard show function from Alan.
- A stacking limit init fix for max_dev_sectors, so we don't end up
artificially capping some use cases. From Keith.
- Have blk-mq proper end unstarted requests on a dying queue, instead
of pushing that to the driver. From Keith.
- NVMe:
- Update to Kconfig description for NVME_SCSI, since it was
vague and having it on is important for some SUSE distros.
From Christoph.
- Set of fixes from Keith, around surprise removal. Also kills
the no-merge flag, so it supports merging.
- Set of fixes for lightnvm from Matias, Javier, and Wenwei.
- Fix null_blk oops when asked for lightnvm, but not available. From
Matias.
- Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails
if interrupted by a signal.
- Two floppy fixes from Jiri, fixing signal handling and blocking
open.
- A use-after-free fix for O_DIRECT, from Mike Krinkin.
- A block module ref count fix from Roman Pen.
- An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini.
- Smaller reallo fix for xen-blkfront from Bob Liu.
- Removal of an unused struct member in the deadline IO scheduler,
from Tahsin.
- Also from Tahsin, properly initialize inode struct members
associated with cgroup writeback, if enabled.
- From Tejun, ensure that we keep the superblock pinned during cgroup
writeback"
* 'for-linus' of git://git.kernel.dk/linux-block: (25 commits)
blk: fix overflow in queue_discard_max_hw_show
writeback: initialize inode members that track writeback history
writeback: keep superblock pinned during cgroup writeback association switches
bio: return EINTR if copying to user space got interrupted
NVMe: Rate limit nvme IO warnings
NVMe: Poll device while still active during remove
NVMe: Requeue requests on suspended queues
NVMe: Allow request merges
NVMe: Fix io incapable return values
blk-mq: End unstarted requests on dying queue
block: Initialize max_dev_sectors to 0
null_blk: oops when initializing without lightnvm
block: fix module reference leak on put_disk() call for cgroups throttle
nvme: fix Kconfig description for BLK_DEV_NVME_SCSI
kernel/fs: fix I/O wait not accounted for RW O_DSYNC
floppy: refactor open() flags handling
lightnvm: allow to force mm initialization
lightnvm: check overflow and correct mlc pairs
lightnvm: fix request intersection locking in rrpc
lightnvm: warn if irqs are disabled in lock laddr
...
|
|
unreferenced object 0xffffc90000abf000 (size 16900):
comm "fsync02", pid 15765, jiffies 4297431627 (age 423.772s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 a0 c2 19 00 88 ff ff ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8174d54e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811b9b91>] __vmalloc_node_range+0x231/0x280
[<ffffffff811b9c2a>] __vmalloc+0x4a/0x50
[<ffffffffa02c9ec1>] ext_tree_prepare_commit+0x231/0x2e0 [blocklayoutdriver]
[<ffffffffa02c700e>] bl_prepare_layoutcommit+0xe/0x10 [blocklayoutdriver]
[<ffffffffa0596a6c>] pnfs_layoutcommit_inode+0x29c/0x330 [nfsv4]
[<ffffffffa0596b13>] pnfs_generic_sync+0x13/0x20 [nfsv4]
[<ffffffffa0585188>] nfs4_file_fsync+0x58/0x150 [nfsv4]
[<ffffffff81228e5b>] vfs_fsync_range+0x4b/0xb0
[<ffffffff81228f1d>] do_fsync+0x3d/0x70
[<ffffffff812291d0>] SyS_fsync+0x10/0x20
[<ffffffff81757def>] entry_SYSCALL_64_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
v2, add missing include header
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The newly added NFS v4.2 operations (ALLOCATE, DEALLOCATE, SEEK and CLONE)
use a helper called nfs42_set_rw_stateid to select a stateid that is sent
to the server. But they don't set the inode and state fields in the
nfs4_exception structure, and this don't partake in the stateid recovery
protocol. Because of this they will simply return errors insted of trying
to recover a stateid when the server return a BAD_STATEID error.
Additionally CLONE has the problem that it operates on two files and thus
two stateids, and thus needs to call the exception handler twice to
recover stateids.
While we're at it stop grabbing an addititional reference to the open
context in all these operations - having the file open guarantees that
the open context won't go away.
All this can be produces with the generic/168 and generic/170 tests in
xfstests which stress the CLONE stateid handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
In the case where d_add_unique() finds an appropriate alias to use it will
have already incremented the reference count. An additional dget() to swap
the open context's dentry is unnecessary and will leak a reference.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 275bb307865a3 ("NFSv4: Move dentry instantiation into the NFSv4-...")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
inode struct members that track cgroup writeback information
should be reinitialized when inode gets allocated from
kmem_cache. Otherwise, their values remain and get used by the
new inode.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull cifs fixes from Steve French:
"A small set of cifs fixes.
I am still reviewing some more, recently submitted SMB3 fixes, but
these three are small and safe and ready now"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix erroneous return value
cifs: fix potential overflow in cifs_compose_mount_options
cifs: remove redundant check for null string pointer
|
|
If cgroup writeback is in use, an inode is associated with a cgroup
for writeback. If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously. Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.
kernel BUG at fs/jbd2/transaction.c:319!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
Hardware name: Google Google, BIOS Google 01/01/2011
Workqueue: events inode_switch_wbs_work_fn
task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
RSP: 0018:ffff880209267c30 EFLAGS: 00010202
...
Call Trace:
[<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
[<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
[<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
[<ffffffff8035338b>] evict+0xbb/0x190
[<ffffffff80354190>] iput+0x130/0x190
[<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
[<ffffffff80279819>] process_one_work+0x129/0x300
[<ffffffff80279b16>] worker_thread+0x126/0x480
[<ffffffff8027ed14>] kthread+0xc4/0xe0
[<ffffffff809771df>] ret_from_fork+0x3f/0x70
Fix it by bumping s_active while cgroup association switching is in
flight.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We can see following loop(10000 times) in trace_log:
[ 75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
...(10000 times)
[ 124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
[ 124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2
[ 124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1
Reason:
If more than one user trigger reada in same extent, the first task
finished setting of reada data struct and call reada_start_machine()
to start, and the second task only add a ref_count but have not
add reada_extctl struct completely, the reada_extent can not finished
all jobs, and will be selected in __reada_start_machine() for 10000
times(total times in __reada_start_machine()).
Fix:
For a reada_extent without job, we don't need to run it, just return
0 to let caller break.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In rechecking zone-in-tree, we still need to check zone include
our logical address.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid additional locking-acquirment and one pair of
kref_get/put by combine two condition.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
reada_zone->end is end pos of segment:
end = start + cache->key.offset - 1;
So we need to use "<=" in condition to judge is a pos in the
segment.
The problem happened rearly, because logical pos rarely pointed
to last 4k of a blockgroup, but we need to fix it to make code
right in logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent
Pull EFI fixes from Matt Fleming:
* Prevent accidental deletion of EFI variables through efivarfs that
may brick machines. We use a whitelist of known-safe variables to
allow things like installing distributions to work out of the box, and
instead restrict vendor-specific variable deletion by making
non-whitelist variables immutable (Peter Jones)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When ext4_bread() fails, fname_crypto_str remains
allocated after return. Fix that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
|
|
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:
dio_end_io(dio_bio, bio->bi_error);
It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().
This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.
Cc: stable@vger.kernel.org # 4.3+
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
|
When setting the layout return mode, we must always also set the
NFS_LAYOUT_RETURN_REQUESTED flag to ensure that we send a layoutreturn.
Otherwise pnfs_error_mark_layout_for_return() could set the mode, but
fail to send the layoutreturn because another is already in flight.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We don't need to schedule a layoutreturn if the layout segment can
be freed immediately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are a number of small tty and serial driver fixes for 4.5-rc4
that resolve some reported issues.
One of them got reverted as it wasn't correct based on testing, and
all have been in linux-next for a while"
* tag 'tty-4.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "8250: uniphier: allow modular build with 8250 console"
pty: make sure super_block is still valid in final /dev/tty close
pty: fix possible use after free of tty->driver_data
tty: Add support for PCIe WCH382 2S multi-IO card
serial/omap: mark wait_for_xmitr as __maybe_unused
serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485)
8250: uniphier: allow modular build with 8250 console
tty: Drop krefs for interrupted tty lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This has a few fixes from Filipe, along with a readdir fix from Dave
that we've been testing for some time"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: properly set the termination value of ctx->pos in readdir
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
Btrfs: remove no longer used function extent_read_full_page_nolock()
Btrfs: fix page reading in extent_same ioctl leading to csum errors
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fix from Dve Chinner:
"This contains a fix for an endian conversion issue in new CRC
validation in log recovery that was discovered on a ppc64 platform"
* tag 'xfs-fixes-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: fix endianness error when checking log block crc on big endian platforms
|