Age | Commit message (Collapse) | Author |
|
We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.
Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.
If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:
struct io_uring_files_update {
__u32 offset;
__s32 *fds;
};
that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:
1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
replaced with ->fds[i].
For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.
This improves the performance of linked SQEs, and reduces the CPU
overhead.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
I need to pick up the independent changes made to
Documentation/core-api/memory-allocation.rst to be able to merge further
work without creating a total mess.
|
|
Now that we have a vtime safe kcpustat accessor for CPUTIME_SYSTEM, use
it to start fixing frozen kcpustat values on nohz_full CPUs.
Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-13-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Remove xfs_zero_file_space and reorganize xfs_file_fallocate so that a
single call to xfs_alloc_file_space covers all modes that preallocate
blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If we always have to write out of place preallocating blocks is
pointless. We already check for this in the normal falloc path, but
the check was missig in the legacy ALLOCSP path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
These ioctls are implemented by the VFS and mapped to ->fallocate now,
so this code won't ever be reached.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Add a new xfs_inode_buftarg helper that gets the data I/O buftarg for a
given inode. Replace the existing xfs_find_bdev_for_inode and
xfs_find_daxdev_for_inode helpers with this new general one and cleanup
some of the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.
Defer ring fd installation until after we're done reading those
values.
Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Spotted by Ronnie.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We want the sysfs fix in here as well to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull cifs fixes from Steve French:
"Seven cifs/smb3 fixes, including three for stable"
* tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
CIFS: Fix use after free of file info structures
CIFS: Fix retry mid list corruption on reconnects
cifs: Fix missed free operations
CIFS: avoid using MID 0xFFFF
cifs: clarify comment about timestamp granularity for old servers
cifs: Handle -EINPROGRESS only when noblockcnt is set
|
|
Pull block and io_uring fixes from Jens Axboe:
"A bit bigger than usual at this point in time, mostly due to some good
bug hunting work by Pavel that resulted in three io_uring fixes from
him and two from me. Anyway, this pull request contains:
- Revert of the submit-and-wait optimization for io_uring, it can't
always be done safely. It depends on commands always making
progress on their own, which isn't necessarily the case outside of
strict file IO. (me)
- Series of two patches from me and three from Pavel, fixing issues
with shared data and sequencing for io_uring.
- Lastly, two timeout sequence fixes for io_uring (zhangyi)
- Two nbd patches fixing races (Josef)
- libahci regulator_get_optional() fix (Mark)"
* tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block:
nbd: verify socket is supported during setup
ata: libahci_platform: Fix regulator_get_optional() misuse
nbd: handle racing with error'ed out commands
nbd: protect cmd->status with cmd->lock
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
io_uring: used cached copies of sq->dropped and cq->overflow
io_uring: Fix race for sqes with userspace
io_uring: Fix broken links with offloading
io_uring: Fix corrupted user_data
io_uring: correct timeout req sequence when inserting a new entry
io_uring : correct timeout req sequence when waiting timeout
io_uring: revert "io_uring: optimize submit_and_wait API"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fix from Dan Williams:
"Fix a performance regression that followed from a fix to the
conversion of the fsdax implementation to the xarray. v5.3 users
report that they stop seeing huge page mappings on an application +
filesystem layout that was seeing huge pages previously on v5.2"
* tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
fs/dax: Fix pmd vs pte conflict detection
|
|
According to commit message in the original commit c75b1d9421f8 ("fs:
add fcntl() interface for setting/getting write life time hints"),
as well as userspace library[1] and man page update[2], R/W hint constants
are intended to have RWH_* prefix. However, RWF_WRITE_LIFE_NOT_SET retained
"RWF_*" prefix used in the early versions of the proposed patch set[3].
Rename it and provide the old name as a synonym for the new one
for backward compatibility.
[1] https://github.com/axboe/fio/commit/bd553af6c849
[2] https://github.com/mkerrisk/man-pages/commit/580082a186fd
[3] https://www.mail-archive.com/linux-block@vger.kernel.org/msg09638.html
Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
no users outside of fs/dcache.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e13346 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fdecf ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.
For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In commit 8648de2c581e ("f2fs: add bio cache for IPU"), we added
f2fs_submit_ipu_bio() in __write_data_page() as below:
__write_data_page()
if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode)) {
f2fs_submit_ipu_bio(sbi, bio, page);
....
}
in order to avoid below deadlock:
Thread A Thread B
- __write_data_page (inode x, page y)
- f2fs_do_write_data_page
- set_page_writeback ---- set writeback flag in page y
- f2fs_inplace_write_data
- f2fs_balance_fs
- lock gc_mutex
- lock gc_mutex
- f2fs_gc
- do_garbage_collect
- gc_data_segment
- move_data_page
- f2fs_wait_on_page_writeback
- wait_on_page_writeback --- wait writeback of page y
However, the bio submission breaks the merge of IPU IOs.
So in this patch let's add a global bio cache for merged IPU pages,
then f2fs_wait_on_page_writeback() is able to submit bio if a
writebacked page is cached in global bio cache.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe
Reorder them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.
The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().
Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.
The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
if the second call of should_expire() in there ends up
grabbing and returning a new reference to dentry, we need
to drop it before continuing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
There's a deadlock that is possible and can easily be seen with
a test where multiple readers open/read/close of the same file
and a disruption occurs causing reconnect. The deadlock is due
a reader thread inside cifs_strict_readv calling down_read and
obtaining lock_sem, and then after reconnect inside
cifs_reopen_file calling down_read a second time. If in
between the two down_read calls, a down_write comes from
another process, deadlock occurs.
CPU0 CPU1
---- ----
cifs_strict_readv()
down_read(&cifsi->lock_sem);
_cifsFileInfo_put
OR
cifs_new_fileinfo
down_write(&cifsi->lock_sem);
cifs_reopen_file()
down_read(&cifsi->lock_sem);
Fix the above by changing all down_write(lock_sem) calls to
down_write_trylock(lock_sem)/msleep() loop, which in turn
makes the second down_read call benign since it will never
block behind the writer while holding lock_sem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed--by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Currently the code assumes that if a file info entry belongs
to lists of open file handles of an inode and a tcon then
it has non-zero reference. The recent changes broke that
assumption when putting the last reference of the file info.
There may be a situation when a file is being deleted but
nothing prevents another thread to reference it again
and start using it. This happens because we do not hold
the inode list lock while checking the number of references
of the file info structure. Fix this by doing the proper
locking when doing the check.
Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo")
Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
Cc: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When the client hits reconnect it iterates over the mid
pending queue marking entries for retry and moving them
to a temporary list to issue callbacks later without holding
GlobalMid_Lock. In the same time there is no guarantee that
mids can't be removed from the temporary list or even
freed completely by another thread. It may cause a temporary
list corruption:
[ 430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469
[ 430.464668] ------------[ cut here ]------------
[ 430.466569] kernel BUG at lib/list_debug.c:51!
[ 430.468476] invalid opcode: 0000 [#1] SMP PTI
[ 430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19
[ 430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
...
[ 430.510426] Call Trace:
[ 430.511500] cifs_reconnect+0x25e/0x610 [cifs]
[ 430.513350] cifs_readv_from_socket+0x220/0x250 [cifs]
[ 430.515464] cifs_read_from_socket+0x4a/0x70 [cifs]
[ 430.517452] ? try_to_wake_up+0x212/0x650
[ 430.519122] ? cifs_small_buf_get+0x16/0x30 [cifs]
[ 430.521086] ? allocate_buffers+0x66/0x120 [cifs]
[ 430.523019] cifs_demultiplex_thread+0xdc/0xc30 [cifs]
[ 430.525116] kthread+0xfb/0x130
[ 430.526421] ? cifs_handle_standard+0x190/0x190 [cifs]
[ 430.528514] ? kthread_park+0x90/0x90
[ 430.530019] ret_from_fork+0x35/0x40
Fix this by obtaining extra references for mids being retried
and marking them as MID_DELETED which indicates that such a mid
has been dequeued from the pending list.
Also move mid cleanup logic from DeleteMidQEntry to
_cifs_mid_q_entry_release which is called when the last reference
to a particular mid is put. This allows to avoid any use-after-free
of response buffers.
The patch needs to be backported to stable kernels. A stable tag
is not mentioned below because the patch doesn't apply cleanly
to any actively maintained stable kernel.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-and-tested-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"Fix a memory leak introduced in -rc1"
* tag 'gfs2-v5.4-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix memory leak when gfs2meta's fs_context is freed
|
|
Flags passed to Q_XQUOTARM were not sanity checked for invalid values.
Fix that.
Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl")
Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
gfs2 and gfs2meta share an ->init_fs_context function which allocates an
args structure stored in fc->fs_private. gfs2 registers a ->free
function to free this memory when the fs_context is cleaned up, but
there was not one registered for gfs2meta, causing a leak.
Register a ->free function for gfs2meta. The existing gfs2_fc_free
function does what we need.
Reported-by: syzbot+c2fdfd2b783754878fb6@syzkaller.appspotmail.com
Fixes: 1f52aa08d12f ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Currently, we do not check memory allocation
result for ei->i_block_alloc_info in ioctl,
this patch checks it and returns error in
failure case.
Link: https://lore.kernel.org/r/20191023135643.28837-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.
req->sequence
req_1 (count = 2): 2
req_2 (count = 1): 2
Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.
This reverts two commits:
bf7ec93c644cb
c576666863b78
Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The xfs_pnfs.c file is missing an include of xfs_pnfs.h to
add the prototypes of the functions it exports. Include this
file to fix the following sparse warnings:
fs/xfs/xfs_pnfs.c:27:1: warning: symbol 'xfs_break_leased_layouts' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:52:1: warning: symbol 'xfs_fs_get_uuid' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:77:1: warning: symbol 'xfs_fs_map_blocks' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:226:1: warning: symbol 'xfs_fs_commit_blocks' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
xfs_bmapi_write() takes a total block requirement parameter that is
passed down to the block allocation code and is used to specify the
total block requirement of the associated transaction. This is used
to try and select an AG that can not only satisfy the requested
extent allocation, but can also accommodate subsequent allocations
that might be required to complete the transaction. For example,
additional bmbt block allocations may be required on insertion of
the resulting extent to an inode data fork.
While it's important for callers to calculate and reserve such extra
blocks in the transaction, it is not necessary to pass the total
value to xfs_bmapi_write() in all cases. The latter automatically
sets minleft to ensure that sufficient free blocks remain after the
allocation attempt to expand the format of the associated inode
(i.e., such as extent to btree conversion, btree splits, etc).
Therefore, any callers that pass a total block requirement of the
bmap mapping length plus worst case bmbt expansion essentially
specify the additional reservation requirement twice. These callers
can pass a total of zero to rely on the bmapi minleft policy.
Beyond being superfluous, the primary motivation for this change is
that the total reservation logic in the bmbt code is dubious in
scenarios where minlen < maxlen and a maxlen extent cannot be
allocated (which is more common for data extent allocations where
contiguity is not required). The total value is based on maxlen in
the xfs_bmapi_write() caller. If the bmbt code falls back to an
allocation between minlen and maxlen, that allocation will not
succeed until total is reset to minlen, which essentially throws
away any additional reservation included in total by the caller. In
addition, the total value is not reset until after alignment is
dropped, which means that such callers drop alignment far too
aggressively than necessary.
Update all callers of xfs_bmapi_write() that pass a total block
value of the mapping length plus bmbt reservation to instead pass
zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
requirement. This trades off slightly less conservative AG selection
for the ability to preserve alignment in more scenarios.
xfs_bmapi_write() callers that incorporate unrelated or additional
reservations in total beyond what is already included in minleft
must continue to use the former.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Cap longest extent to the largest we can allocate based on limits
calculated at mount time. Dynamic state (such as finobt blocks)
can result in the longest free extent exceeding the size we can
allocate, and that results in failure to align full AG allocations
when the AG is empty.
Result:
xfs_io-4413 [003] 426.412459: xfs_alloc_vextent_loopfailed: dev 8:96 agno 0 agbno 32 minlen 243968 maxlen 244000 mod 0 prod 1 minleft 1 total 262148 alignment 32 minalignslop 0 len 0 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x5 firstblock 0xffffffffffffffff
minlen and maxlen are now separated by the alignment size, and
allocation fails because args.total > free space in the AG.
[bfoster: Added xfs_bmap_btalloc() changes.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
KUnit tests for decoding extended 64 bit timestamps that verify the
seconds part of [a/c/m] timestamps in ext4 inode structs are decoded
correctly.
Test data is derived from the table in the Inode Timestamps section of
Documentation/filesystems/ext4/inodes.rst.
KUnit tests run during boot and output the results to the debug log
in TAP format (http://testanything.org/). Only useful for kernel devs
running KUnit test harness and are not for inclusion into a production
build.
Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|