summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-02-11io_uring: move submit side state closer in the ringJens Axboe
We recently added the submit side req cache, but it was placed at the end of the struct. Move it near the other submission state for better memory placement, and reshuffle a few other members at the same time. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11fs/jfs: fix potential integer overflow on shift of a intColin Ian King
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit arithmetic and then assigned to a signed 64 bit integer. In the case where l2nb is 32 or more this can lead to an overflow. Avoid this by shifting the value 1LL instead. Addresses-Coverity: ("Uninitentional integer overflow") Fixes: b40c2e665cd5 ("fs/jfs: TRIM support for JFS Filesystem") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2021-02-11cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.Shyam Prasad N
While debugging another issue today, Steve and I noticed that if a subdir for a file share is already mounted on the client, any new mount of any other subdir (or the file share root) of the same share results in sharing the cifs superblock, which e.g. can result in incorrect device name. While setting prefix path for the root of a cifs_sb, CIFS_MOUNT_USE_PREFIX_PATH flag should also be set. Without it, prepath is not even considered in some places, and output of "mount" and various /proc/<>/*mount* related options can be missing part of the device name. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-11cifs: In the new mount api we get the full devname as source=Ronnie Sahlberg
so we no longer need to handle or parse the UNC= and prefixpath= options that mount.cifs are generating. This also fixes a bug in the mount command option where the devname would be truncated into just //server/share because we were looking at the truncated UNC value and not the full path. I.e. in the mount command output the devive //server/share/path would show up as just //server/share Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-11xfs: consider shutdown in bmapbt cursor delete assertBrian Foster
The assert in xfs_btree_del_cursor() checks that the bmapbt block allocation field has been handled correctly before the cursor is freed. This field is used for accurate calculation of indirect block reservation requirements (for delayed allocations), for example. generic/019 reproduces a scenario where this assert fails because the filesystem has shutdown while in the middle of a bmbt record insertion. This occurs after a bmbt block has been allocated via the cursor but before the higher level bmap function (i.e. xfs_bmap_add_extent_hole_real()) completes and resets the field. Update the assert to accommodate the transient state if the filesystem has shutdown. While here, clean up the indentation and comments in the function. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-11io_uring: assign file_slot prior to calling io_sqe_file_register()Jens Axboe
We use the assigned slot in io_sqe_file_register(), and a previous patch moved the assignment to after we have called it. This isn't super pretty, and will get cleaned up in the future. For now, fix the regression by restoring the previous assignment/clear of the file_slot. Fixes: ea64ec02b31d ("io_uring: deduplicate file table slot calculation") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11erofs: initialized fields can only be observed after bit is setGao Xiang
Currently, although set_bit() & test_bit() pairs are used as a fast- path for initialized configurations. However, these atomic ops are actually relaxed forms. Instead, load-acquire & store-release form is needed to make sure uninitialized fields won't be observed in advance here (yet no such corresponding bitops so use full barriers instead.) Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com Fixes: 62dc45979f3f ("staging: erofs: fix race of initializing xattrs of a inode at the same time") Fixes: 152a333a5895 ("staging: erofs: add compacted compression indexes support") Cc: <stable@vger.kernel.org> # 5.3+ Reported-by: Huang Jianan <huangjianan@oppo.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-11erofs: fix shift-out-of-bounds of blkszbitsGao Xiang
syzbot generated a crafted bitszbits which can be shifted out-of-bounds[1]. So directly print unsupported blkszbits instead of blksize. [1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-10xfs: fix boolreturn.cocci warningskernel test robot
fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 'xfs_log_need_covered' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: scripts/coccinelle/misc/boolreturn.cocci Fixes: 37444fc4cc39 ("xfs: lift writable fs check up into log worker task") CC: Brian Foster <bfoster@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10xfs: restore shutdown check in mapped write fault pathBrian Foster
XFS triggers an iomap warning in the write fault path due to a !PageUptodate() page if a write fault happens to occur on a page that recently failed writeback. The iomap writeback error handling code can clear the Uptodate flag if no portion of the page is submitted for I/O. This is reproduced by fstest generic/019, which combines various forms of I/O with simulated disk failures that inevitably lead to filesystem shutdown (which then unconditionally fails page writeback). This is a regression introduced by commit f150b4234397 ("xfs: split the iomap ops for buffered vs direct writes") due to the removal of a shutdown check and explicit error return in the ->iomap_begin() path used by the write fault path. The explicit error return historically translated to a SIGBUS, but now carries on with iomap processing where it complains about the unexpected state. Restore the shutdown check to xfs_buffered_write_iomap_begin() to restore historical behavior. Fixes: f150b4234397 ("xfs: split the iomap ops for buffered vs direct writes") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10io_uring: remove redundant initialization of variable retColin Ian King
The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: unpark SQPOLL thread for cancelationPavel Begunkov
We park SQPOLL task before going into io_uring_cancel_files(), so the task won't run task_works including those that might be important for the cancellation passes. In this case it's io_poll_remove_one(), which frees requests via io_put_req_deferred(). Unpark it for while waiting, it's ok as we disable submissions beforehand, so no new requests will be generated. INFO: task syz-executor893:8493 blocked for more than 143 seconds. Call Trace: context_switch kernel/sched/core.c:4327 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5078 schedule+0xcf/0x270 kernel/sched/core.c:5157 io_uring_cancel_files fs/io_uring.c:8912 [inline] io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067 io_uring_files_cancel include/linux/io_uring.h:51 [inline] do_exit+0x2fe/0x2ae0 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 __do_sys_exit_group kernel/exit.c:933 [inline] __se_sys_exit_group kernel/exit.c:931 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Cc: stable@vger.kernel.org # 5.5+ Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10Revert "io_uring: don't take fs for recvmsg/sendmsg"Jens Axboe
This reverts commit 10cad2c40dcb04bb46b2bf399e00ca5ea93d36b0. Petr reports that with this commit in place, io_uring fails the chroot test (CVE-202-29373). We do need to retain ->fs for send/recvmsg, so revert this commit. Reported-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10nilfs2: make splice write available againJoachim Henke
Since 5.10, splice() or sendfile() to NILFS2 return EINVAL. This was caused by commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops"). This patch initializes the splice_write field in file_operations, like most file systems do, to restore the functionality. Link: https://lkml.kernel.org/r/1612784101-14353-1-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Joachim Henke <joachim.henke@t-systems.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-10zonefs: use zone write granularity as block sizeDamien Le Moal
Zoned block devices have different granularity constraints for write operations into sequential zones. E.g. ZBC and ZAC devices require that writes be aligned to the device physical block size while NVMe ZNS devices allow logical block size aligned write operations. To correctly handle such difference, use the device zone write granularity limit to set the block size of a zonefs volume, thus allowing the smallest possible write unit for all zoned device types. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: place ring SQ/CQ arrays under memcg memory limitsJens Axboe
Instead of imposing rlimit memlock limits for the rings themselves, ensure that we account them properly under memcg with __GFP_ACCOUNT. We retain rlimit memlock for registered buffers, this is just for the ring arrays themselves. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable kmemcg account for io_uring requestsJens Axboe
This puts io_uring under the memory cgroups accounting and limits for requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for IRQ driven IOJens Axboe
This is the last class of requests that cannot utilize the req alloc cache. Add a per-ctx req cache that is protected by the completion_lock, and refill our submit side cache when it gets over our batch count. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: fix possible deadlock in io_uring_pollHao Xu
Abaci reported follow issue: [ 30.615891] ====================================================== [ 30.616648] WARNING: possible circular locking dependency detected [ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted [ 30.618035] ------------------------------------------------------ [ 30.618914] a.out/1128 is trying to acquire lock: [ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220 [ 30.620505] [ 30.620505] but task is already holding lock: [ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.622349] [ 30.622349] which lock already depends on the new lock. [ 30.622349] [ 30.623289] [ 30.623289] the existing dependency chain (in reverse order) is: [ 30.624243] [ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}: [ 30.625263] lock_acquire+0x2c7/0x390 [ 30.625868] __mutex_lock+0xae/0x9f0 [ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70 [ 30.627278] io_uring_poll+0xcb/0xd0 [ 30.627890] ep_item_poll.isra.14+0x4e/0x90 [ 30.628531] do_epoll_ctl+0xb7e/0x1120 [ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0 [ 30.629770] do_syscall_64+0x2d/0x40 [ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.631187] [ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}: [ 30.631985] check_prevs_add+0x226/0xb00 [ 30.632584] __lock_acquire+0x1237/0x13a0 [ 30.633207] lock_acquire+0x2c7/0x390 [ 30.633740] __mutex_lock+0xae/0x9f0 [ 30.634258] __ep_eventpoll_poll+0x9f/0x220 [ 30.634879] __io_arm_poll_handler+0xbf/0x220 [ 30.635462] io_issue_sqe+0xa6b/0x13e0 [ 30.635982] __io_queue_sqe+0x10b/0x550 [ 30.636648] io_queue_sqe+0x235/0x470 [ 30.637281] io_submit_sqes+0xcce/0xf10 [ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.638465] do_syscall_64+0x2d/0x40 [ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.639643] [ 30.639643] other info that might help us debug this: [ 30.639643] [ 30.640618] Possible unsafe locking scenario: [ 30.640618] [ 30.641402] CPU0 CPU1 [ 30.641938] ---- ---- [ 30.642664] lock(&ctx->uring_lock); [ 30.643425] lock(&ep->mtx); [ 30.644498] lock(&ctx->uring_lock); [ 30.645668] lock(&ep->mtx); [ 30.646321] [ 30.646321] *** DEADLOCK *** [ 30.646321] [ 30.647642] 1 lock held by a.out/1128: [ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.649954] [ 30.649954] stack backtrace: [ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1 [ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 30.652290] Call Trace: [ 30.652688] dump_stack+0xac/0xe3 [ 30.653164] check_noncircular+0x11e/0x130 [ 30.653747] ? check_prevs_add+0x226/0xb00 [ 30.654303] check_prevs_add+0x226/0xb00 [ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0 [ 30.655564] __lock_acquire+0x1237/0x13a0 [ 30.656262] lock_acquire+0x2c7/0x390 [ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.658014] __mutex_lock+0xae/0x9f0 [ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.659112] ? mark_held_locks+0x5a/0x80 [ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 30.660885] ? trace_hardirqs_on+0x46/0x110 [ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.662696] __ep_eventpoll_poll+0x9f/0x220 [ 30.663273] ? __ep_eventpoll_poll+0x220/0x220 [ 30.663875] __io_arm_poll_handler+0xbf/0x220 [ 30.664463] io_issue_sqe+0xa6b/0x13e0 [ 30.664984] ? __lock_acquire+0x782/0x13a0 [ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.666170] ? __io_queue_sqe+0x10b/0x550 [ 30.666725] __io_queue_sqe+0x10b/0x550 [ 30.667252] ? __fget_files+0x131/0x260 [ 30.667791] ? io_req_prep+0xd8/0x1090 [ 30.668316] ? io_queue_sqe+0x235/0x470 [ 30.668868] io_queue_sqe+0x235/0x470 [ 30.669398] io_submit_sqes+0xcce/0xf10 [ 30.669931] ? xa_load+0xe4/0x1c0 [ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180 [ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80 [ 30.672380] do_syscall_64+0x2d/0x40 [ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.673503] RIP: 0033:0x7fd89c813239 [ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa [ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239 [ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003 [ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000 [ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0 [ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000 This might happen if we do epoll_wait on a uring fd while reading/writing the former epoll fd in a sqe in the former uring instance. So let's don't flush cqring overflow list, just do a simple check. Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: defer flushing cached reqsPavel Begunkov
Awhile there are requests in the allocation cache -- use them, only if those ended go for the stashed memory in comp.free_list. As list manipulation are generally heavy and are not good for caches, flush them all or as much as can in one go. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: return success/failure from io_flush_cached_reqs()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: take comp_state from ctxPavel Begunkov
__io_queue_sqe() is always called with a non-NULL comp_state, which is taken directly from context. Don't pass it around but infer from ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for task_work itemsJens Axboe
task_work is run without utilizing the req alloc cache, so any deferred items don't get to take advantage of either the alloc or free side of it. With task_work now being wrapped by io_uring, we can use the ctx completion state to both use the req cache and the completion flush batching. With this, the only request type that cannot take advantage of the req cache is IRQ driven IO for regular files / block devices. Anything else, including IOPOLL polled IO to those same tyes, will take advantage of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: provide FIFO ordering for task_workJens Axboe
task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: use persistent request cacheJens Axboe
Now that we have the submit_state in the ring itself, we can have io_kiocb allocations that are persistent across invocations. This reduces the time spent doing slab allocations and frees. [sil: rebased] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: feed reqs back into alloc cachePavel Begunkov
Make io_req_free_batch(), which is used for inline executed requests and IOPOLL, to return requests back into the allocation cache, so avoid most of kmalloc()/kfree() for those cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: persistent req cachePavel Begunkov
Don't free batch-allocated requests across syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: count ctx refs separately from reqsPavel Begunkov
Currently batch free handles request memory freeing and ctx ref putting together. Separate them and use different counters, that will be needed for reusing reqs memory. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove fallback_reqPavel Begunkov
Remove fallback_req for now, it gets in the way of other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: submit-completion free batchingPavel Begunkov
io_submit_flush_completions() does completion batching, but may also use free batching as iopoll does. The main beneficiaries should be buffered reads/writes and send/recv. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: replace list with array for compl batchPavel Begunkov
Reincarnation of an old patch that replaces a list in struct io_compl_batch with an array. It's needed to avoid hooking requests via their compl.list, because it won't be always available in the future. It's also nice to split io_submit_flush_completions() to avoid free under locks and remove unlock/lock with a long comment describing when it can be done. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't reinit submit state every timePavel Begunkov
As now submit_state is retained across syscalls, we can save ourself from initialising it from ground up for each io_submit_sqes(). Set some fields during ctx allocation, and just keep them always consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove unnecessary zeroing of ctx members] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove ctx from comp_statePavel Begunkov
completion state is closely bound to ctx, we don't need to store ctx inside as we always have it around to pass to flush. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't keep submit_state on stackPavel Begunkov
struct io_submit_state is quite big (168 bytes) and going to grow. It's better to not keep it on stack as it is now. Move it to context, it's always protected by uring_lock, so it's fine to have only one instance of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't propagate io_comp_statePavel Begunkov
There is no reason to drag io_comp_state into opcode handlers, we just need a flag and the actual work will be done in __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10gfs2: Recursive gfs2_quota_hold in gfs2_iomap_endAndreas Gruenbacher
When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock -> gfs2_quota_hold is called from gfs2_iomap_begin. At the end of the write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be called again in gfs2_iomap_end, which is incorrect and leads to a failed assertion. Instead, move the call to gfs2_quota_unlock before the call to punch_hole to fix that. Fixes: 64bc06bb32ee ("gfs2: iomap buffered write support") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-09cifs: do not disable noperm if multiuser mount option is not providedRonnie Sahlberg
Fixes small regression in implementation of new mount API. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reported-by: Hyunchul Lee <hyc.lee@gmail.com> Tested-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-09io_uring: make op handlers always take issue flagsPavel Begunkov
Make opcode handler interfaces a bit more consistent by always passing in issue flags. Bulky but pretty easy and mechanical change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-09io_uring: replace force_nonblock with flagsPavel Begunkov
Replace bool force_nonblock with flags. It has a long standing goal of differentiating context from which we execute. Currently we have some subtle places where some invariants, like holding of uring_lock, are subtly inferred. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-09tmpfs: disallow CONFIG_TMPFS_INODE64 on alphaSeth Forshee
As with s390, alpha is a 64-bit architecture with a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and display "inode64" in the mount options, whereas passing "inode64" in the mount options will fail. This leads to erroneous behaviours such as this: # mkdir mnt # mount -t tmpfs nodev mnt # mount -o remount,rw mnt mount: /home/ubuntu/mnt: mount point not mounted or bad option. Prevent CONFIG_TMPFS_INODE64 from being selected on alpha. Link: https://lkml.kernel.org/r/20210208215726.608197-1-seth.forshee@canonical.com Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: <stable@vger.kernel.org> [5.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09tmpfs: disallow CONFIG_TMPFS_INODE64 on s390Seth Forshee
Currently there is an assumption in tmpfs that 64-bit architectures also have a 64-bit ino_t. This is not true on s390 which has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and display "inode64" in the mount options, but passing the "inode64" mount option will fail. This leads to the following behavior: # mkdir mnt # mount -t tmpfs nodev mnt # mount -o remount,rw mnt mount: /home/ubuntu/mnt: mount point not mounted or bad option. As mount sees "inode64" in the mount options and thus passes it in the options for the remount. So prevent CONFIG_TMPFS_INODE64 from being selected on s390. Link: https://lkml.kernel.org/r/20210205230620.518245-1-seth.forshee@canonical.com Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Cc: Hugh Dickins <hughd@google.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: <stable@vger.kernel.org> [5.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09squashfs: add more sanity checks in xattr id lookupPhillip Lougher
Sysbot has reported a warning where a kmalloc() attempt exceeds the maximum limit. This has been identified as corruption of the xattr_ids count when reading the xattr id lookup table. This patch adds a number of additional sanity checks to detect this corruption and others. 1. It checks for a corrupted xattr index read from the inode. This could be because the metadata block is uncompressed, or because the "compression" bit has been corrupted (turning a compressed block into an uncompressed block). This would cause an out of bounds read. 2. It checks against corruption of the xattr_ids count. This can either lead to the above kmalloc failure, or a smaller than expected table to be read. 3. It checks the contents of the index table for corruption. [phillip@squashfs.org.uk: fix checkpatch issue] Link: https://lkml.kernel.org/r/270245655.754655.1612770082682@webmail.123-reg.co.uk Link: https://lkml.kernel.org/r/20210204130249.4495-5-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: syzbot+2ccea6339d368360800d@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09squashfs: add more sanity checks in inode lookupPhillip Lougher
Sysbot has reported an "slab-out-of-bounds read" error which has been identified as being caused by a corrupted "ino_num" value read from the inode. This could be because the metadata block is uncompressed, or because the "compression" bit has been corrupted (turning a compressed block into an uncompressed block). This patch adds additional sanity checks to detect this, and the following corruption. 1. It checks against corruption of the inodes count. This can either lead to a larger table to be read, or a smaller than expected table to be read. In the case of a too large inodes count, this would often have been trapped by the existing sanity checks, but this patch introduces a more exact check, which can identify too small values. 2. It checks the contents of the index table for corruption. [phillip@squashfs.org.uk: fix checkpatch issue] Link: https://lkml.kernel.org/r/527909353.754618.1612769948607@webmail.123-reg.co.uk Link: https://lkml.kernel.org/r/20210204130249.4495-4-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: syzbot+04419e3ff19d2970ea28@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09squashfs: add more sanity checks in id lookupPhillip Lougher
Sysbot has reported a number of "slab-out-of-bounds reads" and "use-after-free read" errors which has been identified as being caused by a corrupted index value read from the inode. This could be because the metadata block is uncompressed, or because the "compression" bit has been corrupted (turning a compressed block into an uncompressed block). This patch adds additional sanity checks to detect this, and the following corruption. 1. It checks against corruption of the ids count. This can either lead to a larger table to be read, or a smaller than expected table to be read. In the case of a too large ids count, this would often have been trapped by the existing sanity checks, but this patch introduces a more exact check, which can identify too small values. 2. It checks the contents of the index table for corruption. Link: https://lkml.kernel.org/r/20210204130249.4495-3-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: syzbot+b06d57ba83f604522af2@syzkaller.appspotmail.com Reported-by: syzbot+c021ba012da41ee9807c@syzkaller.appspotmail.com Reported-by: syzbot+5024636e8b5fd19f0f19@syzkaller.appspotmail.com Reported-by: syzbot+bcbc661df46657d0fa4f@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09squashfs: avoid out of bounds writes in decompressorsPhillip Lougher
Patch series "Squashfs: fix BIO migration regression and add sanity checks". Patch [1/4] fixes a regression introduced by the "migrate from ll_rw_block usage to BIO" patch, which has produced a number of Sysbot/Syzkaller reports. Patches [2/4], [3/4], and [4/4] fix a number of filesystem corruption issues which have produced Sysbot reports in the id, inode and xattr lookup code. Each patch has been tested against the Sysbot reproducers using the given kernel configuration. They have the appropriate "Reported-by:" lines added. Additionally, all of the reproducer filesystems are indirectly fixed by patch [4/4] due to the fact they all have xattr corruption which is now detected there. Additional testing with other configurations and architectures (32bit, big endian), and normal filesystems has also been done to trap any inadvertent regressions caused by the additional sanity checks. This patch (of 4): This is a regression introduced by the patch "migrate from ll_rw_block usage to BIO". Sysbot/Syskaller has reported a number of "out of bounds writes" and "unable to handle kernel paging request in squashfs_decompress" errors which have been identified as a regression introduced by the above patch. Specifically, the patch removed the following sanity check if (length < 0 || length > output->length || (index + length) > msblk->bytes_used) This check did two things: 1. It ensured any reads were not beyond the end of the filesystem 2. It ensured that the "length" field read from the filesystem was within the expected maximum length. Without this any corrupted values can over-run allocated buffers. Link: https://lkml.kernel.org/r/20210204130249.4495-1-phillip@squashfs.org.uk Link: https://lkml.kernel.org/r/20210204130249.4495-2-phillip@squashfs.org.uk Fixes: 93e72b3c612adc ("squashfs: migrate from ll_rw_block usage to BIO") Reported-by: syzbot+6fba78f99b9afd4b5634@syzkaller.appspotmail.com Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: Philippe Liard <pliard@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09cifs: fix dfs-linksRonnie Sahlberg
This fixes a regression following dfs links that was introduced in the patch series for the new mount api. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-09fs/affs: release old buffer head on error pathPan Bian
The reference count of the old buffer head should be decremented on path that fails to get the new buffer head. Fixes: 6b4657667ba0 ("fs/affs: add rename exchange") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09mm: provide a saner PTE walking API for modulesPaolo Bonzini
Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09btrfs: zoned: enable to mount ZONED incompat flagNaohiro Aota
This final patch adds the ZONED incompat flag to the supported flags and enables to mount ZONED flagged file system. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: deal with holes writing out tree-log pagesNaohiro Aota
Since the zoned filesystem requires sequential write out of metadata, we cannot proceed with a hole in tree-log pages. When such a hole exists, btree_write_cache_pages() will return -EAGAIN. This happens when someone, e.g., a concurrent transaction commit, writes a dirty extent in this tree-log commit. If we are not going to wait for the extents, we can hope the concurrent writing fills the hole for us. So, we can ignore the error in this case and hope the next write will succeed. If we want to wait for them and got the error, we cannot wait for them because it will cause a deadlock. So, let's bail out to a full commit in this case. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: reorder log node allocation on zoned filesystemNaohiro Aota
This is the 3/3 patch to enable tree-log on zoned filesystems. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. Reorder the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Cc: Filipe Manana <fdmanana@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>