summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-06-23cifs: misc: Use array_size() in if-statement controlling expressionGustavo A. R. Silva
Use array_size() instead of the open-coded version in the controlling expression of the if statement. Also, while there, use the preferred form for passing a size of a struct. The alternative form where struct name is spelled out hurts readability and introduces an opportunity for a bug when the pointer variable type is changed but the corresponding sizeof that is passed as argument is not. This issue was found with the help of Coccinelle and, audited and fixed manually. Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23cifs: update ctime and mtime during truncateZhang Xiaoxu
As the man description of the truncate, if the size changed, then the st_ctime and st_mtime fields should be updated. But in cifs, we doesn't do it. It lead the xfstests generic/313 failed. So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change the file size Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23cifs/smb3: Fix data inconsistent when punch holeZhang Xiaoxu
When punch hole success, we also can read old data from file: # strace -e trace=pread64,fallocate xfs_io -f -c "pread 20 40" \ -c "fpunch 20 40" -c"pread 20 40" file pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40 fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 20, 40) = 0 pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40 CIFS implements the fallocate(FALLOCATE_FL_PUNCH_HOLE) with send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the remote file to zero, but local page caches not updated, then the local page caches inconsistent with server. Also can be found by xfstests generic/316. So, we need to remove the page caches before send the SMB ioctl(FSCTL_SET_ZERO_DATA) to server. Fixes: 31742c5a33176 ("enable fallocate punch hole ("fallocate -p") for SMB3") Suggested-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Cc: stable@vger.kernel.org # v3.17 Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23cifs/smb3: Fix data inconsistent when zero file rangeZhang Xiaoxu
CIFS implements the fallocate(FALLOC_FL_ZERO_RANGE) with send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the remote file to zero, but local page cache not update, then the data inconsistent with server, which leads the xfstest generic/008 failed. So we need to remove the local page caches before send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. After next read, it will re-cache it. Fixes: 30175628bf7f5 ("[SMB3] Enable fallocate -z support for SMB3 mounts") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Cc: stable@vger.kernel.org # v3.17 Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23io_uring: fix io_sq_thread no schedule when busyXuan Zhuo
When the user consumes and generates sqe at a fast rate, io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY, so that io_sq_thread will never call cond_resched or schedule, and then we will get the following system error prompt: rcu: INFO: rcu_sched self-detected stall on CPU or watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863] This patch checks whether need to call cond_resched() by checking the need_resched() function every cycle. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-23cifs: Fix double add page to memcg when cifs_readpagesZhang Xiaoxu
When xfstests generic/451, there is an BUG at mm/memcontrol.c: page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea index:0xf mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451" flags: 0x2fffff80000001(locked) raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210 raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000 page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup) page->mem_cgroup:ffff88817287d000 ------------[ cut here ]------------ kernel BUG at mm/memcontrol.c:2659! invalid opcode: 0000 [#1] SMP CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_ 073836-buildvm-ppc64le-16.ppc.4 RIP: 0010:commit_charge+0x35/0x50 Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7 c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9 RSP: 0018:ffffc90002023a50 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0 RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005 R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0 FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mem_cgroup_charge+0x166/0x4f0 __add_to_page_cache_locked+0x4a9/0x710 add_to_page_cache_locked+0x15/0x20 cifs_readpages+0x217/0x1270 read_pages+0x29a/0x670 page_cache_readahead_unbounded+0x24f/0x390 __do_page_cache_readahead+0x3f/0x60 ondemand_readahead+0x1f1/0x470 page_cache_async_readahead+0x14c/0x170 generic_file_buffered_read+0x5df/0x1100 generic_file_read_iter+0x10c/0x1d0 cifs_strict_readv+0x139/0x170 new_sync_read+0x164/0x250 __vfs_read+0x39/0x60 vfs_read+0xb5/0x1e0 ksys_pread64+0x85/0xf0 __x64_sys_pread64+0x22/0x30 do_syscall_64+0x69/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5071fcb1af Code: Bad RIP value. RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001 R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000 Modules linked in: ---[ end trace 725fa14a3e1af65c ]--- Since commit 3fea5a499d57 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API") not cancel the page charge, the pages maybe double add to pagecache: thread1 | thread2 cifs_readpages readpages_get_pages add_to_page_cache_locked(head,index=n)=0 | readpages_get_pages | add_to_page_cache_locked(head,index=n+1)=0 add_to_page_cache_locked(head, index=n+1)=-EEXIST then, will next loop with list head page's index=n+1 and the page->mapping not NULL readpages_get_pages add_to_page_cache_locked(head, index=n+1) commit_charge VM_BUG_ON_PAGE So, we should not do the next loop when any page add to page cache failed. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-06-23Merge tag 'for-5.8-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A number of fixes, located in two areas, one performance fix and one fixup for better integration with another patchset. - bug fixes in nowait aio: - fix snapshot creation hang after nowait-aio was used - fix failure to write to prealloc extent past EOF - don't block when extent range is locked - block group fixes: - relocation failure when scrub runs in parallel - refcount fix when removing fails - fix race between removal and creation - space accounting fixes - reinstante fast path check for log tree at unlink time, fixes performance drop up to 30% in REAIM - kzfree/kfree fixup to ease treewide patchset renaming kzfree" * tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use kfree() in btrfs_ioctl_get_subvol_info() btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO btrfs: fix RWF_NOWAIT write not failling when we need to cow btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof btrfs: fix hang on snapshot creation after RWF_NOWAIT write btrfs: check if a log root exists before locking the log_mutex on unlink btrfs: fix bytes_may_use underflow when running balance and scrub in parallel btrfs: fix data block group relocation failure due to concurrent scrub btrfs: fix race between block group removal and block group creation btrfs: fix a block group ref counter leak after failure to remove block group
2020-06-22xfs: fix use-after-free on CIL context on shutdownDave Chinner
xlog_wait() on the CIL context can reference a freed context if the waiter doesn't get scheduled before the CIL context is freed. This can happen when a task is on the hard throttle and the CIL push aborts due to a shutdown. This was detected by generic/019: thread 1 thread 2 __xfs_trans_commit xfs_log_commit_cil <CIL size over hard throttle limit> xlog_wait schedule xlog_cil_push_work wake_up_all <shutdown aborts commit> xlog_cil_committed kmem_free remove_wait_queue spin_lock_irqsave --> UAF Fix it by moving the wait queue to the CIL rather than keeping it in in the CIL context that gets freed on push completion. Because the wait queue is now independent of the CIL context and we might have multiple contexts in flight at once, only wake the waiters on the push throttle when the context we are pushing is over the hard throttle size threshold. Fixes: 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-06-21cifs: Fix cached_fid refcnt leak in open_shrootXiyu Yang
open_shroot() invokes kref_get(), which increases the refcount of the "tcon->crfid" object. When open_shroot() returns not zero, it means the open operation failed and close_shroot() will not be called to decrement the refcount of the "tcon->crfid". The reference counting issue happens in one normal path of open_shroot(). When the cached root have been opened successfully in a concurrent process, the function increases the refcount and jump to "oshr_free" to return. However the current return value "rc" may not equal to 0, thus the increased refcount will not be balanced outside the function, causing a refcnt leak. Fix this issue by setting the value of "rc" to 0 before jumping to "oshr_free" label. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2020-06-21io_uring: kill NULL checks for submit statePavel Begunkov
After recent changes, io_submit_sqes() always passes valid submit state, so kill leftovers checking it for NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: set @poll->file after @poll initPavel Begunkov
It's a good practice to modify fields of a struct after but not before it was initialised. Even though io_init_poll_iocb() doesn't touch poll->file, call it first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: remove REQ_F_MUST_PUNTPavel Begunkov
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same as not having REQ_F_NOWAIT set. That rather creates more confusion. Moreover, it doesn't even affect any behaviour (e.g. see the patch removing it from io_{read,write}). Kill theg flag and update already outdated comments. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: remove setting REQ_F_MUST_PUNT in rwPavel Begunkov
io_{read,write}() { ... copy_iov: // prep async if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file)) flags |= REQ_F_MUST_PUNT; } REQ_F_MUST_PUNT there is pointless, because if it happens then REQ_F_NOWAIT is known to be _not_ set, and the request will go async path in __io_queue_sqe() anyway. file_can_poll() check is also repeated in arm_poll*(), so don't need it. Remove the mentioned assignment REQ_F_MUST_PUNT in preparation for killing the flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: support true async buffered reads, if file provides itJens Axboe
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt the buffered read to an io-wq worker. Instead we can rely on page unlocking callbacks to support retry based async IO. This is a lot more efficient than doing async thread offload. The retry is done similarly to how we handle poll based retry. From the unlock callback, we simply queue the retry to a task_work based handler. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21btrfs: flag files as supporting buffered async readsJens Axboe
btrfs uses generic_file_read_iter(), which already supports this. Acked-by: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21xfs: flag files as supporting buffered async readsJens Axboe
XFS uses generic_file_read_iter(), which already supports this. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21block: flag block devices as supporting IOCB_WAITQJens Axboe
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: re-issue block requests that failed because of resourcesJens Axboe
Mark the plug with nowait == true, which will cause requests to avoid blocking on request allocation. If they do, we catch them and reissue them from a task_work based handler. Normally we can catch -EAGAIN directly, but the hard case is for split requests. As an example, the application issues a 512KB request. The block core will split this into 128KB if that's the max size for the device. The first request issues just fine, but we run into -EAGAIN for some latter splits for the same request. As the bio is split, we don't get to see the -EAGAIN until one of the actual reads complete, and hence we cannot handle it inline as part of submission. This does potentially cause re-reads of parts of the range, as the whole request is reissued. There's currently no better way to handle this. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: catch -EIO from buffered issue request failureJens Axboe
-EIO bubbles up like -EAGAIN if we fail to allocate a request at the lower level. Play it safe and treat it like -EAGAIN in terms of sync retry, to avoid passing back an errant -EIO. Catch some of these early for block based file, as non-mq devices generally do not support NOWAIT. That saves us some overhead by not first trying, then retrying from async context. We can go straight to async punt instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: always plug for any number of IOsJens Axboe
Currently we only plug if we're doing more than two request. We're going to be relying on always having the plug there to pass down information, so plug unconditionally. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: separate reporting of ring pages from registered pagesBijan Mottahedeh
Ring pages are not pinned so it is more appropriate to report them as locked. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: report pinned memory usageBijan Mottahedeh
Report pinned memory usage always, regardless of whether locked memory limit is enforced. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: rename ctx->account_mem fieldBijan Mottahedeh
Rename account_mem to limit_name to clarify its purpose. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: add wrappers for memory accountingBijan Mottahedeh
Facilitate separation of locked memory usage reporting vs. limiting for upcoming patches. No functional changes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [axboe: kill unnecessary () around return in io_account_mem()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behaviorJiufei Xue
Applications can pass this flag in to avoid accept thundering herd. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21io_uring: change the poll type to be 32-bitsJiufei Xue
poll events should be 32-bits to cover EPOLLEXCLUSIVE. Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-20Merge tag 'trace-v5.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Have recordmcount work with > 64K sections (to support LTO) - kprobe RCU fixes - Correct a kprobe critical section with missing mutex - Remove redundant arch_disarm_kprobe() call - Fix lockup when kretprobe triggers within kprobe_flush_task() - Fix memory leak in fetch_op_data operations - Fix sleep in atomic in ftrace trace array sample code - Free up memory on failure in sample trace array code - Fix incorrect reporting of function_graph fields in format file - Fix quote within quote parsing in bootconfig - Fix return value of bootconfig tool - Add testcases for bootconfig tool - Fix maybe uninitialized warning in ftrace pid file code - Remove unused variable in tracing_iter_reset() - Fix some typos * tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix maybe-uninitialized compiler warning tools/bootconfig: Add testcase for show-command and quotes test tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig tools/bootconfig: Fix to use correct quotes for value proc/bootconfig: Fix to use correct quotes for value tracing: Remove unused event variable in tracing_iter_reset tracing/probe: Fix memleak in fetch_op_data operations trace: Fix typo in allocate_ftrace_ops()'s comment tracing: Make ftrace packed events have align of 1 sample-trace-array: Remove trace_array 'sample-instance' sample-trace-array: Fix sleeping function called from invalid context kretprobe: Prevent triggering kretprobe from within kprobe_flush_task kprobes: Remove redundant arch_disarm_kprobe() call kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex kprobes: Use non RCU traversal APIs on kprobe_tables if possible kprobes: Suppress the suspicious RCU warning on kprobes recordmcount: support >64k sections
2020-06-20afs: Fix hang on rmmod due to outstanding timerDavid Howells
The fileserver probe timer, net->fs_probe_timer, isn't cancelled when the kafs module is being removed and so the count it holds on net->servers_outstanding doesn't get dropped.. This causes rmmod to wait forever. The hung process shows a stack like: afs_purge_servers+0x1b5/0x23c [kafs] afs_net_exit+0x44/0x6e [kafs] ops_exit_list+0x72/0x93 unregister_pernet_operations+0x14c/0x1ba unregister_pernet_subsys+0x1d/0x2a afs_exit+0x29/0x6f [kafs] __do_sys_delete_module.isra.0+0x1a2/0x24b do_syscall_64+0x51/0x95 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by: (1) Attempting to cancel the probe timer and, if successful, drop the count that the timer was holding. (2) Make the timer function just drop the count and not schedule the prober if the afs portion of net namespace is being destroyed. Also, whilst we're at it, make the following changes: (3) Initialise net->servers_outstanding to 1 and decrement it before waiting on it so that it doesn't generate wake up events by being decremented to 0 until we're cleaning up. (4) Switch the atomic_dec() on ->servers_outstanding for ->fs_timer in afs_purge_servers() to use the helper function for that. Fixes: f6cbb368bcb0 ("afs: Actively poll fileservers to maintain NAT or firewall openings") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-20afs: Fix afs_do_lookup() to call correct fetch-status op variantDavid Howells
Fix afs_do_lookup()'s fallback case for when FS.InlineBulkStatus isn't supported by the server. In the fallback, it calls FS.FetchStatus for the specific vnode it's meant to be looking up. Commit b6489a49f7b7 broke this by renaming one of the two identically-named afs_fetch_status_operation descriptors to something else so that one of them could be made non-static. The site that used the renamed one, however, wasn't renamed and didn't produce any warning because the other was declared in a header. Fix this by making afs_do_lookup() use the renamed variant. Note that there are two variants of the success method because one is called from ->lookup() where we may or may not have an inode, but can't call iget until after we've talked to the server - whereas the other is called from within iget where we have an inode, but it may or may not be initialised. The latter variant expects there to be an inode, but because it's being called from there former case, there might not be - resulting in an oops like the following: BUG: kernel NULL pointer dereference, address: 00000000000000b0 ... RIP: 0010:afs_fetch_status_success+0x27/0x7e ... Call Trace: afs_wait_for_operation+0xda/0x234 afs_do_lookup+0x2fe/0x3c1 afs_lookup+0x3c5/0x4bd __lookup_slow+0xcd/0x10f walk_component+0xa2/0x10c path_lookupat.isra.0+0x80/0x110 filename_lookup+0x81/0x104 vfs_statx+0x76/0x109 __do_sys_newlstat+0x39/0x6b do_syscall_64+0x4c/0x78 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: b6489a49f7b7 ("afs: Fix silly rename") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-19Merge tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Catch a case where io_sq_thread() didn't do proper mm acquire - Ensure poll completions are reaped on shutdown - Async cancelation and run fixes (Pavel) - io-poll race fixes (Xiaoguang) - Request cleanup race fix (Xiaoguang) * tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block: io_uring: fix possible race condition against REQ_F_NEED_CLEANUP io_uring: reap poll completions while waiting for refs to drop on exit io_uring: acquire 'mm' for task_work for SQPOLL io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed io_uring: don't fail links for EAGAIN error in IOPOLL mode io_uring: cancel by ->task not pid io_uring: lazy get task io_uring: batch cancel in io_uring_cancel_files() io_uring: cancel all task's requests on exit io-wq: add an option to cancel all matched reqs io-wq: reorder cancellation pending -> running io_uring: fix lazy work init
2020-06-19Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Use import_uuid() where appropriate (Andy) - bcache fixes (Coly, Mauricio, Zhiqiang) - blktrace sparse warnings fix (Jan) - blktrace concurrent setup fix (Luis) - blkdev_get use-after-free fix (Jason) - Ensure all blk-mq maps are updated (Weiping) - Loop invalidate bdev fix (Zheng) * tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block: block: make function 'kill_bdev' static loop: replace kill_bdev with invalidate_bdev partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense block: update hctx map when use multiple maps blktrace: Avoid sparse warnings when assigning q->blk_trace blktrace: break out of blktrace setup on concurrent calls block: Fix use-after-free in blkdev_get() trace/events/block.h: drop kernel-doc for dropped function parameter blk-mq: Remove redundant 'return' statement bcache: pr_info() format clean up in bcache_device_init() bcache: use delayed kworker fo asynchronous devices registration bcache: check and adjust logical block size for backing devices bcache: fix potential deadlock problem in btree_gc_coalesce
2020-06-18f2fs: use kfree() to free variables allocated by match_strdup()Wang Xiaojun
Use kfree() instead of kvfree() to free variables allocated by match_strdup(). Because the memory is allocated with kmalloc inside match_strdup(). Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18Merge branch 'hch' (maccess patches from Christoph Hellwig)Linus Torvalds
Merge non-faulting memory access cleanups from Christoph Hellwig: "Andrew and I decided to drop the patches implementing your suggested rename of the probe_kernel_* and probe_user_* helpers from -mm as there were way to many conflicts. After -rc1 might be a good time for this as all the conflicts are resolved now" This also adds a type safety checking patch on top of the renaming series to make the subtle behavioral difference between 'get_user()' and 'get_kernel_nofault()' less potentially dangerous and surprising. * emailed patches from Christoph Hellwig <hch@lst.de>: maccess: make get_kernel_nofault() check for minimal type compatibility maccess: rename probe_kernel_address to get_kernel_nofault maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
2020-06-18f2fs: get the right gc victim section when section has several segmentsJack Qiu
Assume each section has 4 segment: .___________________________. |_Segment0_|_..._|_Segment3_| . . . . .__________. |_section0_| Segment 0~2 has 0 valid block, segment 3 has 512 valid blocks. It will fail if we want to gc section0 in this scenes, because all 4 segments in section0 is not dirty. So we should use dirty section bitmap instead of dirty segment bitmap to get right victim section. Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18f2fs: fix a race condition between f2fs_write_end_io and ↵Wuyun Zhao
f2fs_del_fsync_node_entry Under some condition, the __write_node_page will submit a page which is not f2fs_in_warm_node_list and will not call f2fs_add_fsync_node_entry. f2fs_gc continue to run to invoke f2fs_iget -> do_read_inode to read the same node page and set code node, which make f2fs_in_warm_node_list become true, that will cause f2fs_bug_on in f2fs_del_fsync_node_entry when f2fs_write_end_io called. - f2fs_write_end_io - f2fs_iget - do_read_inode - set_cold_node recover cold node flag - f2fs_in_warm_node_list - is_cold_node if node is cold, assume we have added node to fsync_node_list during writepages() - f2fs_del_fsync_node_entry - f2fs_bug_on() due to node page is not in fsync_node_list [ 34.966133] Call trace: [ 34.969902] f2fs_del_fsync_node_entry+0x100/0x108 [ 34.976071] f2fs_write_end_io+0x1e0/0x288 [ 34.981539] bio_endio+0x248/0x270 [ 34.986289] blk_update_request+0x2b0/0x4d8 [ 34.991841] scsi_end_request+0x40/0x440 [ 34.997126] scsi_io_completion+0xa4/0x748 [ 35.002593] scsi_finish_command+0xdc/0x110 [ 35.008143] scsi_softirq_done+0x118/0x150 [ 35.013610] blk_done_softirq+0x8c/0xe8 [ 35.018811] __do_softirq+0x2e8/0x578 [ 35.023828] irq_exit+0xfc/0x120 [ 35.028398] handle_IPI+0x1d8/0x330 [ 35.033233] gic_handle_irq+0x110/0x1d4 [ 35.038433] el1_irq+0xb4/0x130 [ 35.042917] kmem_cache_alloc+0x3f0/0x418 [ 35.048288] radix_tree_node_alloc+0x50/0xf8 [ 35.053933] __radix_tree_create+0xf8/0x188 [ 35.059484] __radix_tree_insert+0x3c/0x128 [ 35.065035] add_gc_inode+0x90/0x118 [ 35.069967] f2fs_gc+0x1b80/0x2d70 [ 35.074718] f2fs_disable_checkpoint+0x94/0x1d0 [ 35.080621] f2fs_fill_super+0x10c4/0x1b88 [ 35.086088] mount_bdev+0x194/0x1e0 [ 35.090923] f2fs_mount+0x40/0x50 [ 35.095589] mount_fs+0xb4/0x190 [ 35.100159] vfs_kern_mount+0x80/0x1d8 [ 35.105260] do_mount+0x478/0xf18 [ 35.109926] ksys_mount+0x90/0xd0 [ 35.114592] __arm64_sys_mount+0x24/0x38 Signed-off-by: Wuyun Zhao <zhaowuyun@wingtech.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18f2fs: remove useless truncate in f2fs_collapse_range()Wei Fang
Since offset < new_size, no need to do truncate_pagecache() again with new_size. Signed-off-by: Wei Fang <fangwei1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18f2fs: use kfree() instead of kvfree() to free superblock dataDenis Efremov
Use kfree() instead of kvfree() to free super in read_raw_super_block() because the memory is allocated with kzalloc() in the function. Use kfree() instead of kvfree() to free sbi, raw_super in f2fs_fill_super() and f2fs_put_super() because the memory is allocated with kzalloc(). Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18f2fs: avoid checkpatch errorJaegeuk Kim
ERROR:INITIALISED_STATIC: do not initialise statics to NULL Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-06-18block: make function 'kill_bdev' staticZheng Bin
kill_bdev does not have any external user, so make it static. Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18io_uring: fix possible race condition against REQ_F_NEED_CLEANUPXiaoguang Wang
In io_read() or io_write(), when io request is submitted successfully, it'll go through the below sequence: kfree(iovec); req->flags &= ~REQ_F_NEED_CLEANUP; return ret; But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may already have been completed, and then io_complete_rw_iopoll() and io_complete_rw() will be called, both of which will also modify req->flags if needed. This causes a race condition, with concurrent non-atomic modification of req->flags. To eliminate this race, in io_read() or io_write(), if io request is submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the iovec cleanup work correspondingly. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17io_uring: reap poll completions while waiting for refs to drop on exitJens Axboe
If we're doing polled IO and end up having requests being submitted async, then completions can come in while we're waiting for refs to drop. We need to reap these manually, as nobody else will be looking for them. Break the wait into 1/20th of a second time waits, and check for done poll completions if we time out. Otherwise we can have done poll completions sitting in ctx->poll_list, which needs us to reap them but we're just waiting for them. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17io_uring: acquire 'mm' for task_work for SQPOLLJens Axboe
If we're unlucky with timing, we could be running task_work after having dropped the memory context in the sq thread. Since dropping the context requires a runnable task state, we cannot reliably drop it as part of our check-for-work loop in io_sq_thread(). Instead, abstract out the mm acquire for the sq thread into a helper, and call it from the async task work handler. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17io_uring: add memory barrier to synchronize io_kiocb's result and ↵Xiaoguang Wang
iopoll_completed In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll completed are two independent store operations, to ensure that once iopoll_completed is ture and then req->result must been perceived by the cpu executing io_do_iopoll(), proper memory barrier should be used. And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is, we'll need to issue this io request using io-wq again. In order to just issue a single smp_rmb() on the completion side, move the re-submit work to io_iopoll_complete(). Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> [axboe: don't set ->iopoll_completed for -EAGAIN retry] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17io_uring: don't fail links for EAGAIN error in IOPOLL modeXiaoguang Wang
In IOPOLL mode, for EAGAIN error, we'll try to submit io request again using io-wq, so don't fail rest of links if this io request has links. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofaultChristoph Hellwig
Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17nfsd: apply umask on fs without ACL supportJ. Bruce Fields
The server is failing to apply the umask when creating new objects on filesystems without ACL support. To reproduce this, you need to use NFSv4.2 and a client and server recent enough to support umask, and you need to export a filesystem that lacks ACL support (for example, ext4 with the "noacl" mount option). Filesystems with ACL support are expected to take care of the umask themselves (usually by calling posix_acl_create). For filesystems without ACL support, this is up to the caller of vfs_create(), vfs_mknod(), or vfs_mkdir(). Reported-by: Elliott Mitchell <ehem+debian@m5p.com> Reported-by: Salvatore Bonaccorso <carnil@debian.org> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Fixes: 47057abde515 ("nfsd: add support for the umask attribute") Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-16proc/bootconfig: Fix to use correct quotes for valueMasami Hiramatsu
Fix /proc/bootconfig to select double or single quotes corrctly according to the value. If a bootconfig value includes a double quote character, we must use single-quotes to quote that value. This modifies if() condition and blocks for avoiding double-quote in value check in 2 places. Anyway, since xbc_array_for_each_value() can handle the array which has a single node correctly. Thus, if (vnode && xbc_node_is_array(vnode)) { xbc_array_for_each_value(vnode) /* vnode->next != NULL */ ... } else { snprintf(val); /* val is an empty string if !vnode */ } is equivalent to if (vnode) { xbc_array_for_each_value(vnode) /* vnode->next can be NULL */ ... } else { snprintf(""); /* value is always empty */ } Link: http://lkml.kernel.org/r/159230244786.65555.3763894451251622488.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16Merge tag 'afs-fixes-20200616' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "I've managed to get xfstests kind of working with afs. Here are a set of patches that fix most of the bugs found. There are a number of primary issues: - Incorrect handling of mtime and non-handling of ctime. It might be argued, that the latter isn't a bug since the AFS protocol doesn't support ctime, but I should probably still update it locally. - Shared-write mmap, truncate and writeback bugs. This includes not changing i_size under the callback lock, overwriting local i_size with the reply from the server after a partial writeback, not limiting the writeback from an mmapped page to EOF. - Checks for an abort code indicating that the primary vnode in an operation was deleted by a third-party are done in the wrong place. - Silly rename bugs. This includes an incomplete conversion to the new operation handling, duplicate nlink handling, nlink changing not being done inside the callback lock and insufficient handling of third-party conflicting directory changes. And some secondary ones: - The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO. - Remove a couple of unused or incompletely used bits. - Remove a couple of redundant success checks. These seem to fix all the data-corruption bugs found by ./check -afs -g quick along with the obvious silly rename bugs and time bugs. There are still some test failures, but they seem to fall into two classes: firstly, the authentication/security model is different to the standard UNIX model and permission is arbitrated by the server and cached locally; and secondly, there are a number of features that AFS does not support (such as mknod). But in these cases, the tests themselves need to be adapted or skipped. Using the in-kernel afs client with xfstests also found a bug in the AuriStor AFS server that has been fixed for a future release" * tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix silly rename afs: afs_vnode_commit_status() doesn't need to check the RPC error afs: Fix use of afs_check_for_remote_deletion() afs: Remove afs_operation::abort_code afs: Fix yfs_fs_fetch_status() to honour vnode selector afs: Remove yfs_fs_fetch_file_status() as it's not used afs: Fix the mapping of the UAEOVERFLOW abort code afs: Fix truncation issues and mmap writeback size afs: Concoct ctimes afs: Fix EOF corruption afs: afs_write_end() should change i_size under the right lock afs: Fix non-setting of mtime when writing into mmap
2020-06-16Merge tag 'flex-array-conversions-5.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array member conversions from Gustavo A. R. Silva: "Replace zero-length arrays with flexible-array members. Notice that all of these patches have been baking in linux-next for two development cycles now. There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. C99 introduced “flexible array members”, which lacks a numeric size for the array declaration entirely: struct something { size_t count; struct foo items[]; }; This is the way the kernel expects dynamically sized trailing elements to be declared. It allows the compiler to generate errors when the flexible array does not occur last in the structure, which helps to prevent some kind of undefined behavior[3] bugs from being inadvertently introduced to the codebase. It also allows the compiler to correctly analyze array sizes (via sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no mechanism that warns us that the following application of the sizeof() operator to a zero-length array always results in zero: struct something { size_t count; struct foo items[0]; }; struct something *instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items) * instance->count; memcpy(instance->items, source, size); At the last line of code above, size turns out to be zero, when one might have thought it represents the total size in bytes of the dynamic memory recently allocated for the trailing array items. Here are a couple examples of this issue[4][5]. Instead, flexible array members have incomplete type, and so the sizeof() operator may not be applied[6], so any misuse of such operators will be immediately noticed at build time. The cleanest and least error-prone way to implement this is through the use of a flexible array member: struct something { size_t count; struct foo items[]; }; struct something *instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items[0]) * instance->count; memcpy(instance->items, source, size); instead" [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") [4] commit f2cd32a443da ("rndis_wlan: Remove logically dead code") [5] commit ab91c2a89f86 ("tpm: eventlog: Replace zero-length array with flexible-array member") [6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html * tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits) w1: Replace zero-length array with flexible-array tracing/probe: Replace zero-length array with flexible-array soc: ti: Replace zero-length array with flexible-array tifm: Replace zero-length array with flexible-array dmaengine: tegra-apb: Replace zero-length array with flexible-array stm class: Replace zero-length array with flexible-array Squashfs: Replace zero-length array with flexible-array ASoC: SOF: Replace zero-length array with flexible-array ima: Replace zero-length array with flexible-array sctp: Replace zero-length array with flexible-array phy: samsung: Replace zero-length array with flexible-array RxRPC: Replace zero-length array with flexible-array rapidio: Replace zero-length array with flexible-array media: pwc: Replace zero-length array with flexible-array firmware: pcdp: Replace zero-length array with flexible-array oprofile: Replace zero-length array with flexible-array block: Replace zero-length array with flexible-array tools/testing/nvdimm: Replace zero-length array with flexible-array libata: Replace zero-length array with flexible-array kprobes: Replace zero-length array with flexible-array ...
2020-06-17close_range: add CLOSE_RANGE_UNSHAREChristian Brauner
One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>