summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-03-14io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}Pavel Begunkov
Processing links, io_submit_sqe() prepares requests, drops sqes, and passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or IOSQE_ASYNC requests will go through the same prep, which doesn't expect sqe=NULL and fail with NULL pointer deference. Always do full prepare including io_alloc_async_ctx() for linked requests, and then it can skip the second preparation. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-13afs: Fix client call Rx-phase signal handlingDavid Howells
Fix the handling of signals in client rxrpc calls made by the afs filesystem. Ignore signals completely, leaving call abandonment or connection loss to be detected by timeouts inside AF_RXRPC. Allowing a filesystem call to be interrupted after the entire request has been transmitted and an abort sent means that the server may or may not have done the action - and we don't know. It may even be worse than that for older servers. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13afs: Fix handling of an abort from a service handlerDavid Howells
When an AFS service handler function aborts a call, AF_RXRPC marks the call as complete - which means that it's not going to get any more packets from the receiver. This is a problem because reception of the final ACK is what triggers afs_deliver_to_call() to drop the final ref on the afs_call object. Instead, aborted AFS service calls may then just sit around waiting for ever or until they're displaced by a new call on the same connection channel or a connection-level abort. Fix this by calling afs_set_call_complete() to finalise the afs_call struct representing the call. However, we then need to drop the ref that stops the call from being deallocated. We can do this in afs_set_call_complete(), as the work queue is holding a separate ref of its own, but then we shouldn't do it in afs_process_async_call() and afs_delete_async_call(). call->drop_ref is set to indicate that a ref needs dropping for a call and this is dealt with when we transition a call to AFS_CALL_COMPLETE. But then we also need to get rid of the ref that pins an asynchronous client call. We can do this by the same mechanism, setting call->drop_ref for an async client call too. We can also get rid of call->incoming since nothing ever sets it and only one thing ever checks it (futilely). A trace of the rxrpc_call and afs_call struct ref counting looks like: <idle>-0 [001] ..s5 164.764892: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_new_incoming_call+0x473/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns5 164.766001: rxrpc_call: c=00000002 QUE u=4 sp=rxrpc_propose_ACK+0xbe/0x551 a=00000000442095b5 <idle>-0 [001] .Ns4 164.766005: rxrpc_call: c=00000002 PUT u=3 sp=rxrpc_new_incoming_call+0xa3f/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns7 164.766433: afs_call: c=00000002 WAKE u=2 o=11 sp=rxrpc_notify_socket+0x196/0x33c kworker/1:2-1810 [001] ...1 164.768409: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_process_call+0x25/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.769439: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000001 00000000 02 22 ACK CallAck kworker/1:2-1810 [001] ...1 164.769459: rxrpc_call: c=00000002 PUT u=2 sp=rxrpc_process_call+0x74f/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.770794: afs_call: c=00000002 QUEUE u=3 o=12 sp=afs_deliver_to_call+0x449/0x72c kworker/1:2-1810 [001] ...1 164.770829: afs_call: c=00000002 PUT u=2 o=12 sp=afs_process_async_call+0xdb/0x11e kworker/1:2-1810 [001] ...2 164.771084: rxrpc_abort: c=00000002 95786a88:00000008 s=0 a=1 e=1 K-1 kworker/1:2-1810 [001] ...1 164.771461: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000002 00000000 04 00 ABORT CallAbort kworker/1:2-1810 [001] ...1 164.771466: afs_call: c=00000002 PUT u=1 o=12 sp=SRXAFSCB_ProbeUuid+0xc1/0x106 The abort generated in SRXAFSCB_ProbeUuid(), labelled "K-1", indicates that the local filesystem/cache manager didn't recognise the UUID as its own. Fixes: 2067b2b3f484 ("afs: Fix the CB.ProbeUuid service handler to reply correctly") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13afs: Fix some tracing detailsDavid Howells
Fix a couple of tracelines to indicate the usage count after the atomic op, not the usage count before it to be consistent with other afs and rxrpc trace lines. Change the wording of the afs_call_trace_work trace ID label from "WORK" to "QUEUE" to reflect the fact that it's queueing work, not doing work. Fixes: 341f741f04be ("afs: Refcount the afs_call struct") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13rxrpc: Fix call interruptibility handlingDavid Howells
Fix the interruptibility of kernel-initiated client calls so that they're either only interruptible when they're waiting for a call slot to come available or they're not interruptible at all. Either way, they're not interruptible during transmission. This should help prevent StoreData calls from being interrupted when writeback is in progress. It doesn't, however, handle interruption during the receive phase. Userspace-initiated calls are still interruptable. After the signal has been handled, sendmsg() will return the amount of data copied out of the buffer and userspace can perform another sendmsg() call to continue transmission. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: "These are mostly fscontext fixes, but there is also one that fixes collisions seen in fscache: - Ensure the fs_context has the correct fs_type when mounting and submounting - Fix leaking of ctx->nfs_server.hostname - Add minor version to fscache key to prevent collisions" * tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs: add minor version to nfs_server_key for fscache NFS: Fix leak of ctx->nfs_server.hostname NFS: Don't hard-code the fs_type when submounting NFS: Ensure the fs_context has the correct fs_type before mounting
2020-03-13Merge tag 'fuse-fixes-5.6-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix an Oops introduced in v5.4" * tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix stack use after return
2020-03-13Merge tag 'ovl-fixes-5.6-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix three bugs introduced in this cycle" * tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lockdep warning for async write ovl: fix some xino configurations ovl: fix lock in ovl_llseek()
2020-03-13btrfs: fix log context list corruption after rename whiteout errorFilipe Manana
During a rename whiteout, if btrfs_whiteout_for_rename() returns an error we can end up returning from btrfs_rename() with the log context object still in the root's log context list - this happens if 'sync_log' was set to true before we called btrfs_whiteout_for_rename() and it is dangerous because we end up with a corrupt linked list (root->log_ctxs) as the log context object was allocated on the stack. After btrfs_rename() returns, any task that is running btrfs_sync_log() concurrently can end up crashing because that linked list is traversed by btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in the same issue that commit e6c617102c7e4 ("Btrfs: fix log context list corruption after rename exchange operation") fixed. Fixes: d4682ba03ef618 ("Btrfs: sync log after logging new name") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-13Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fix from Jens Axboe: "Just a single fix here, improving the RCU callback ordering from last week. After a bit more perusing by Paul, he poked a hole in the original" * tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block: io_uring: ensure RCU callback ordering with rcu_barrier()
2020-03-13afs: Use kfree_rcu() instead of casting kfree() to rcu_callback_tJann Horn
afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong in theory, this might also blow up when people start enforcing function types via compiler instrumentation, and it means the rcu_head has to be first in struct afs_addr_list. Use kfree_rcu() instead, it's simpler and more correct. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-13ovl: fix lockdep warning for async writeMiklos Szeredi
Lockdep reports "WARNING: lock held when returning to user space!" due to async write holding freeze lock over the write. Apparently aio.c already deals with this by lying to lockdep about the state of the lock. Do the same here. No need to check for S_IFREG() here since these file ops are regular-only. Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com Fixes: 2406a307ac7d ("ovl: implement async IO routines") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-13ovl: fix some xino configurationsAmir Goldstein
Fix up two bugs in the coversion to xino_mode: 1. xino=off does not always end up in disabled mode 2. xino=auto on 32bit arch should end up in disabled mode Take a proactive approach to disabling xino on 32bit kernel: 1. Disable XINO_AUTO config during build time 2. Disable xino with a warning on mount time As a by product, xino=on on 32bit arch also ends up in disabled mode. We never intended to enable xino on 32bit arch and this will make the rest of the logic simpler. Fixes: 0f831ec85eda ("ovl: simplify ovl_same_sb() helper") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-12Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "A couple of fixes for old crap in ->atomic_open() instances" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cifs_atomic_open(): fix double-put on late allocation failure gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
2020-03-12cifs_atomic_open(): fix double-put on late allocation failureAl Viro
several iterations of ->atomic_open() calling conventions ago, we used to need fput() if ->atomic_open() failed at some point after successful finish_open(). Now (since 2016) it's not needed - struct file carries enough state to make fput() work regardless of the point in struct file lifecycle and discarding it on failure exits in open() got unified. Unfortunately, I'd missed the fact that we had an instance of ->atomic_open() (cifs one) that used to need that fput(), as well as the stale comment in finish_open() demanding such late failure handling. Trivially fixed... Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:" Cc: stable@kernel.org # v4.7+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcacheAl Viro
with the way fs/namei.c:do_last() had been done, ->atomic_open() instances needed to recognize the case when existing file got found with O_EXCL|O_CREAT, either by falling back to finish_no_open() or failing themselves. gfs2 one didn't. Fixes: 6d4ade986f9c (GFS2: Add atomic_open support) Cc: stable@kernel.org # v3.11 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12ovl: fix lock in ovl_llseek()Amir Goldstein
ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek() was replaced with ovl_inode_lock(), we did not add a check for error. Fix this by making ovl_inode_lock() uninterruptible and change the existing call sites to use an _interruptible variant. Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com Fixes: b1f9d3858f72 ("ovl: use ovl_inode_lock in ovl_llseek()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-12io-wq: remove duplicated cancel codePavel Begunkov
Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt fix from Eric Biggers: "Fix a bug where if userspace is writing to encrypted files while the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running, dirty inodes could be evicted, causing writes could be lost or the filesystem to hang due to a use-after-free. This was encountered during real-world use, not just theoretical. Tested with the existing fscrypt xfstests, and with a new xfstest I wrote to reproduce this bug. This fix does expose an existing bug with '-o lazytime' that Ted is working on fixing, but this fix is more critical and needed anyway regardless of the lazytime fix" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: don't evict dirty inodes after removing key
2020-03-11io_uring: fix truncated async read/readv and write/writev retryJens Axboe
Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size. Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabledXiaoguang Wang
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention. I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; } - if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) { and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120" original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% -------------------------------------------------------------------- with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% -------------------------------------------------------------------- From above test results, we can see that bw has above 5.5%~13% improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: Fix unused function warningsYueHaibing
If CONFIG_NET is not set, gcc warns: fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~ There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Minor tweaks. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add end-of-bits marker and build time verify itJens Axboe
Not easy to tell if we're going over the size of bits we can shove in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON() check for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: provide means of removing buffersJens Axboe
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSGJens Axboe
Like IORING_OP_READV, this is limited to supporting just a single segment in the iovec passed in. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READVJens Axboe
This adds support for the vectored read. This is limited to supporting just 1 segment in the iov, and is provided just for convenience for applications that use IORING_OP_READV already. The iov helpers will be used for IORING_OP_RECVMSG as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: support buffer selection for OP_READ and OP_RECVJens Axboe
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10io_uring: add IORING_OP_PROVIDE_BUFFERSJens Axboe
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-09pstore/ram: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Link: https://lore.kernel.org/r/20200309202327.GA8813@embeddedor Signed-off-by: Kees Cook <keescook@chromium.org>
2020-03-09Merge 5.6-rc5 into driver-core-nextGreg Kroah-Hartman
We need the driver core and debugfs changes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-08io_uring: ensure RCU callback ordering with rcu_barrier()Jens Axboe
After more careful studying, Paul informs me that we cannot rely on ordering of RCU callbacks in the way that the the tagged commit did. The current construct looks like this: void C(struct rcu_head *rhp) { do_something(rhp); call_rcu(&p->rh, B); } call_rcu(&p->rh, A); call_rcu(&p->rh, C); and we're relying on ordering between A and B, which isn't guaranteed. Make this explicit instead, and have a work item issue the rcu_barrier() to ensure that A has run before we manually execute B. While thorough testing never showed this issue, it's dependent on the per-cpu load in terms of RCU callbacks. The updated method simplifies the code as well, and eliminates the need to maintain an rcu_head in the fileset data. Fixes: c1e2148f8ecb ("io_uring: free fixed_file_data after RCU grace period") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-08Merge tag 'driver-core-5.6-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs fixes from Greg KH: "Here are four small driver core / debugfs patches for 5.6-rc3: - debugfs api cleanup now that all debugfs_create_regset32() callers have been fixed up. This was waiting until after the -rc1 merge as these fixes came in through different trees - driver core sync state fixes based on reports of minor issues found in the feature All of these have been in linux-next with no reported issues" * tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: Skip unnecessary work when device doesn't have sync_state() driver core: Add dev_has_sync_state() driver core: Call sync_state() even if supplier has no consumers debugfs: remove return value of debugfs_create_regset32()
2020-03-08Merge branch 'efi/urgent' into efi/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-07fscrypt: don't evict dirty inodes after removing keyEric Biggers
After FS_IOC_REMOVE_ENCRYPTION_KEY removes a key, it syncs the filesystem and tries to get and put all inodes that were unlocked by the key so that unused inodes get evicted via fscrypt_drop_inode(). Normally, the inodes are all clean due to the sync. However, after the filesystem is sync'ed, userspace can modify and close one of the files. (Userspace is *supposed* to close the files before removing the key. But it doesn't always happen, and the kernel can't assume it.) This causes the inode to be dirtied and have i_count == 0. Then, fscrypt_drop_inode() failed to consider this case and indicated that the inode can be dropped, causing the write to be lost. On f2fs, other problems such as a filesystem freeze could occur due to the inode being freed while still on f2fs's dirty inode list. Fix this bug by making fscrypt_drop_inode() only drop clean inodes. I've written an xfstest which detects this bug on ext4, f2fs, and ubifs. Fixes: b1c0ec3599f4 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl") Cc: <stable@vger.kernel.org> # v5.4+ Link: https://lore.kernel.org/r/20200305084138.653498-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-07Merge tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Here are a few io_uring fixes that should go into this release. This contains: - Removal of (now) unused io_wq_flush() and associated flag (Pavel) - Fix cancelation lockup with linked timeouts (Pavel) - Fix for potential use-after-free when freeing percpu ref for fixed file sets - io-wq cancelation fixups (Pavel)" * tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block: io_uring: fix lockup with timeouts io_uring: free fixed_file_data after RCU grace period io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
2020-03-07io_uring: fix lockup with timeoutsPavel Begunkov
There is a recipe to deadlock the kernel: submit a timeout sqe with a linked_timeout (e.g. test_single_link_timeout_ception() from liburing), and SIGKILL the process. Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it during io_put_free() to cancel the linked timeout. Probably, the same can happen with another io_kill_timeout() call site, that is io_commit_cqring(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06Merge tag 'for-5.6-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One fixup for DIO when in use with the new checksums, a missed case where the checksum size was still assuming u32" * tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix RAID direct I/O reads with alternate csums
2020-03-06Merge tag 'filelock-v5.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking fixes from Jeff Layton: "Just a couple of late-breaking patches for the file locking code. The second patch (from yangerkun) fixes a rather nasty looking potential use-after-free that should go to stable. The other patch could technically wait for 5.7, but it's fairly innocuous so I figured we might as well take it" * tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix a potential use-after-free problem when wakeup a waiter fcntl: Distribute switch variables for initialization
2020-03-06io_uring: free fixed_file_data after RCU grace periodJens Axboe
The percpu refcount protects this structure, and we can have an atomic switch in progress when exiting. This makes it unsafe to just free the struct normally, and can trigger the following KASAN warning: BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 Read of size 1 at addr ffff888181a19a30 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: <IRQ> dump_stack+0x76/0xa0 print_address_description.constprop.0+0x3b/0x60 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 __kasan_report.cold+0x1a/0x3d ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 rcu_core+0x370/0x830 ? percpu_ref_exit+0x50/0x50 ? rcu_note_context_switch+0x7b0/0x7b0 ? run_rebalance_domains+0x11d/0x140 __do_softirq+0x10a/0x3e9 irq_exit+0xd5/0xe0 smp_apic_timer_interrupt+0x86/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x26/0x1f0 Fix this by punting the final exit and free of the struct to RCU, then we know that it's safe to do so. Jann suggested the approach of using a double rcu callback to achieve this. It's important that we do a nested call_rcu() callback, as otherwise the free could be ordered before the atomic switch, even if the latter was already queued. Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06locks: fix a potential use-after-free problem when wakeup a waiteryangerkun
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the logic to check waiter->fl_blocker without blocked_lock_lock. And it will trigger a UAF when we try to wakeup some waiter: Thread 1 has create a write flock a on file, and now thread 2 try to unlock and delete flock a, thread 3 try to add flock b on the same file. Thread2 Thread3 flock syscall(create flock b) ...flock_lock_inode_wait flock_lock_inode(will insert our fl_blocked_member list to flock a's fl_blocked_requests) sleep flock syscall(unlock) ...flock_lock_inode_wait locks_delete_lock_ctx ...__locks_wake_up_blocks __locks_delete_blocks( b->fl_blocker = NULL) ... break by a signal locks_delete_block b->fl_blocker == NULL && list_empty(&b->fl_blocked_requests) success, return directly locks_free_lock b wake_up(&b->fl_waiter) trigger UAF Fix it by remove this logic, and this patch may also fix CVE-2019-19769. Cc: stable@vger.kernel.org Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-06fat: fix uninit-memory access for partial initialized inodeOGAWA Hirofumi
When get an error in the middle of reading an inode, some fields in the inode might be still not initialized. And then the evict_inode path may access those fields via iput(). To fix, this makes sure that inode fields are initialized. Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06futex: Fix inode life-time issuePeter Zijlstra
As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-04io_uring: buffer registration infrastructureJens Axboe
This just prepares the ring for having lists of buffers associated with it, that the application can provide for SQEs to consume instead of providing their own. The buffers are organized by group ID. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io_uring/io-wq: forward submission ref to asyncPavel Begunkov
First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL. io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io-wq: optimise out *next_work() double lockPavel Begunkov
When executing non-linked hashed work, io_worker_handle_work() will lock-unlock wqe->lock to update hash, and then immediately lock-unlock to get next work. Optimise this case and do lock/unlock only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io-wq: optimise locking in io_worker_handle_work()Pavel Begunkov
There are 2 optimisations: - Now, io_worker_handler_work() do io_assign_current_work() twice per request, and each one adds lock/unlock(worker->lock) pair. The first is to reset worker->cur_work to NULL, and the second to set a real work shortly after. If there is a dependant work, set it immediately, that effectively removes the extra NULL'ing. - And there is no use in taking wqe->lock for linked works, as they are not hashed now. Optimise it out. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04io-wq: shuffle io_worker_handle_work() codePavel Begunkov
This is a preparation patch, it adds some helpers and makes the next patches cleaner. - extract io_impersonate_work() and io_assign_current_work() - replace @next label with nested do-while - move put_work() right after NULL'ing cur_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: get next work with submission ref dropPavel Begunkov
If after dropping the submission reference req->refs == 1, the request is done, because this one is for io_put_work() and will be dropped synchronously shortly after. In this case it's safe to steal a next work from the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: remove @nxt from handlersPavel Begunkov
There will be no use for @nxt in the handlers, and it's doesn't work anyway, so purge it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03io_uring: make submission ref putting consistentPavel Begunkov
The rule is simple, any async handler gets a submission ref and should put it at the end. Make them all follow it, and so more consistent. This is a preparation patch, and as io_wq_assign_next() currently won't ever work, this doesn't care to use io_put_req_find_next() instead of io_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> refcount_inc_not_zero() -> refcount_inc() fix. Signed-off-by: Jens Axboe <axboe@kernel.dk>