linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2019-11-14	ext4: code cleanup for get_next_id	Chengguang Xu
	Now the checks in ext4_get_next_id() and dquot_get_next_id() are almost the same, so just call dquot_get_next_id() instead of ext4_get_next_id(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Link: https://lore.kernel.org/r/20191006103028.31299-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14	ext4: fix leak of quota reservations	Jan Kara
	Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") moved freeing of delayed allocation reservations from dirty page invalidation time to time when we evict corresponding status extent from extent status tree. For inodes which don't have any blocks allocated this may actually happen only in ext4_clear_blocks() which is after we've dropped references to quota structures from the inode. Thus reservation of quota leaked. Fix the problem by clearing quota information from the inode only after evicting extent status tree in ext4_clear_inode(). Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14	ext4: remove unused variable warning in parse_options()	Olof Johansson
	Commit c33fbe8f673c5 ("ext4: Enable blocksize < pagesize for dioread_nolock") removed the only user of 'sbi' outside of the ifdef, so it caused a new warning: fs/ext4/super.c:2068:23: warning: unused variable 'sbi' [-Wunused-variable] Fixes: c33fbe8f673c5 ("ext4: Enable blocksize < pagesize for dioread_nolock") Signed-off-by: Olof Johansson <olof@lixom.net> Link: https://lore.kernel.org/r/20191111022523.34256-1-olof@lixom.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2019-11-14	ext4: Enable encryption for subpage-sized blocks	Chandan Rajendra
	Now that we have the code to support encryption for subpage-sized blocks, this commit removes the conditional check in filesystem mount code. The commit also changes the support statement in Documentation/filesystems/fscrypt.rst to reflect the fact that encryption on filesystems with blocksize less than page size now works. [EB: Tested with 'gce-xfstests -c ext4/encrypt_1k -g auto', using the new "encrypt_1k" config I created. All tests pass except for those that already fail or are excluded with the encrypt or 1k configs, and 2 tests that try to create 1023-byte symlinks which fails since encrypted symlinks are limited to blocksize-3 bytes. Also ran the dedicated encryption tests using 'kvm-xfstests -c ext4/1k -g encrypt'; all pass, including the on-disk ciphertext verification tests.] Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191023033312.361355-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14	fs/buffer.c: support fscrypt in block_read_full_page()	Eric Biggers
	After each filesystem block (as represented by a buffer_head) has been read from disk by block_read_full_page(), decrypt it if needed. The decryption is done on the fscrypt_read_workqueue. This is the final change needed to support ext4 encryption with blocksize != PAGE_SIZE, and it's a fairly small change now that CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to decrypt individual blocks and to enqueue work on the fscrypt workqueue. Don't try to add fs-verity support yet, as the fs/verity/ support layer isn't ready for sub-page blocks yet. Just add fscrypt support for now. Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n. Cc: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14	io_uring: make POLL_ADD/POLL_REMOVE scale better	Jens Axboe
	One of the obvious use cases for these commands is networking, where it's not uncommon to have tons of sockets open and polled for. The current implementation uses a list for insertion and lookup, which works fine for file based use cases where the count is usually low, it breaks down somewhat for higher number of files / sockets. A test case with 30k sockets being polled for and cancelled takes: real 0m6.968s user 0m0.002s sys 0m6.936s with the patch it takes: real 0m0.233s user 0m0.010s sys 0m0.176s If you go to 50k sockets, it gets even more abysmal with the current code: real 0m40.602s user 0m0.010s sys 0m40.555s with the patch it takes: real 0m0.398s user 0m0.000s sys 0m0.341s Change is pretty straight forward, just replace the cancel_list with a red/black tree instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	ceph: increment/decrement dio counter on async requests	Jeff Layton
	Ceph can in some cases issue an async DIO request, in which case we can end up calling ceph_end_io_direct before the I/O is actually complete. That may allow buffered operations to proceed while DIO requests are still in flight. Fix this by incrementing the i_dio_count when issuing an async DIO request, and decrement it when tearing down the aio_req. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14	ceph: take the inode lock before acquiring cap refs	Jeff Layton
	Most of the time, we (or the vfs layer) takes the inode_lock and then acquires caps, but ceph_read_iter does the opposite, and that can lead to a deadlock. When there are multiple clients treading over the same data, we can end up in a situation where a reader takes caps and then tries to acquire the inode_lock. Another task holds the inode_lock and issues a request to the MDS which needs to revoke the caps, but that can't happen until the inode_lock is unwedged. Fix this by having ceph_read_iter take the inode_lock earlier, before attempting to acquire caps. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Link: https://tracker.ceph.com/issues/36348 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14	io-wq: remove now redundant struct io_wq_nulls_list	Jens Axboe
	Since we don't iterate these lists anymore after commit: e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") we don't need to retain the nulls value we use for them. That means it's pretty pointless to wrap the hlist_nulls_head in a structure, so get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	block: move clearing bd_invalidated into check_disk_size_change	Christoph Hellwig
	Both callers of check_disk_size_change clear bd_invalidate directly after the call, so move the clearing into check_disk_size_change itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	block: remove (__)blkdev_reread_part as an exported API	Christoph Hellwig
	In general drivers should never mess with partition tables directly. Unfortunately s390 and loop do for somewhat historic reasons, but they can use bdev_disk_changed directly instead when we export it as they satisfy the sanity checks we have in __blkdev_reread_part. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd] Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	block: fix bdev_disk_changed for non-partitioned devices	Christoph Hellwig
	We still have to set the capacity to 0 if invalidating or call revalidate_disk if not even if the disk has no partitions. Fix that by merging rescan_partitions into bdev_disk_changed and just stubbing out blk_add_partitions and blk_drop_partitions for non-partitioned devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	block: move rescan_partitions to fs/block_dev.c	Christoph Hellwig
	Large parts of rescan_partitions aren't about partitions, and moving it to block_dev.c will allow for some further cleanups by merging it into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14	block: merge invalidate_partitions into rescan_partitions	Christoph Hellwig
	A lot of the logic in invalidate_partitions and rescan_partitions is shared. Merge the two functions to simplify things. There is a small behavior change in that we now send the kevent change notice also if we were not invalidating but no partitions were found, which seems like the right thing to do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: Fix getting file for non-fd opcodes	Pavel Begunkov
	For timeout requests and bunch of others io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: introduce req_need_defer()	Bob Liu
	Makes the code easier to read. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: clean up io_uring_cancel_files()	Bob Liu
	We don't use the return value anymore, drop it. Also drop the unecessary double cancel_req value check. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io-wq: ensure free/busy list browsing see all items	Jens Axboe
	We have two lists for workers in io-wq, a busy and a free list. For certain operations we want to browse all workers, and we currently do that by browsing the two separate lists. But since these lists are RCU protected, we can potentially miss workers if they move between the two lists while we're browsing them. Add a third list, all_list, that simply holds all workers. A worker is added to that list when it starts, and removed when it exits. This makes the worker iteration cleaner, too. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: ensure registered buffer import returns the IO length	Jens Axboe
	A test case was reported where two linked reads with registered buffers failed the second link always. This is because we set the expected value of a request in req->result, and if we don't get this result, then we fail the dependent links. For some reason the registered buffer import returned -ERROR/0, while the normal import returns -ERROR/length. This broke linked commands with registered buffers. Fix this by making io_import_fixed() correctly return the mapped length. Cc: stable@vger.kernel.org # v5.3 Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: Fix getting file for timeout	Pavel Begunkov
	For timeout requests io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io-wq: ensure we have a stable view of ->cur_work for cancellations	Jens Axboe
	worker->cur_work is currently protected by the lock of the wqe that the worker belongs to. When we send a signal to a worker, we need a stable view of ->cur_work, so we need to hold that lock. But this doesn't work so well, since we have the opposite order potentially on queueing work. If POLL_ADD is used with a signalfd, then io_poll_wake() is called with the signal lock, and that sometimes needs to insert work items. Add a specific worker lock that protects the current work item. Then we can guarantee that the task we're sending a signal is currently processing the exact work we think it is. Reported-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	f2fs: support STATX_ATTR_VERITY	Eric Biggers
	Set the STATX_ATTR_VERITY bit when the statx() system call is used on a verity file on f2fs. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13	ext4: support STATX_ATTR_VERITY	Eric Biggers
	Set the STATX_ATTR_VERITY bit when the statx() system call is used on a verity file on ext4. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-11-13	Merge tag 'for-5.4-rc7-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix for an older bug that has started to show up during testing (because of an updated test for rename exchange). It's an in-memory corruption caused by local variable leaking out of the function scope" * tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix log context list corruption after rename exchange operation
2019-11-13	io_wq: add get/put_work handlers to io_wq_create()	Jens Axboe
	For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel. Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13	io_uring: check for validity of ->rings in teardown	Jens Axboe
	Normally the rings are always valid, the exception is if we failed to allocate the rings at setup time. syzbot reports this: RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673 io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260 io_uring_create fs/io_uring.c:4600 [inline] io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626 __do_sys_io_uring_setup fs/io_uring.c:4639 [inline] __se_sys_io_uring_setup fs/io_uring.c:4636 [inline] __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441229 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace b0f5b127a57f623f ]--- RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is exactly the case of failing to allocate the SQ/CQ rings, and then entering shutdown. Check if the rings are valid before trying to access them at shutdown time. Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12	block: rework zone reporting	Christoph Hellwig
	Avoid the need to allocate a potentially large array of struct blk_zone in the block layer by switching the ->report_zones method interface to a callback model. Now the caller simply supplies a callback that is executed on each reported zone, and private data for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12	io_uring: fix potential deadlock in io_poll_wake()	Jens Axboe
	We attempt to run the poll completion inline, but we're using trylock to do so. This avoids a deadlock since we're grabbing the locks in reverse order at this point, we already hold the poll wq lock and we're trying to grab the completion lock, while the normal rules are the reverse of that order. IO completion for a timeout link will need to grab the completion lock, but that's not safe from this context. Put the completion under the completion_lock in io_poll_wake(), and mark the request as entering the completion with the completion_lock already held. Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12	kernfs: use 64bit inos if ino_t is 64bit	Tejun Heo
	Each kernfs_node is identified with a 64bit ID. The low 32bit is exposed as ino and the high gen. While this already allows using inos as keys by looking up with wildcard generation number of 0, it's adding unnecessary complications for 64bit ino archs which can directly use kernfs_node IDs as inos to uniquely identify each cgroup instance. This patch exposes IDs directly as inos on 64bit ino archs. The conversion is mostly straight-forward. * 32bit ino archs behave the same as before. 64bit ino archs now use the whole 64bit ID as ino and the generation number is fixed at 1. * 64bit inos still use the same idr allocator which gurantees that the lower 32bits identify the current live instance uniquely and the high 32bits are incremented whenever the low bits wrap. As the upper 32bits are no longer used as gen and we don't wanna start ino allocation with 33rd bit set, the initial value for highbits allocation is changed to 0 on 64bit ino archs. * blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify the issuing cgroup. Userland builds FILEID_INO32_GEN fids from these numbers to look up the cgroups. To remain compatible with the behavior, always output (LOW32,HIGH32) which will be constructed back to the original 64bit ID by __kernfs_fh_to_dentry(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12	kernfs: implement custom exportfs ops and fid type	Tejun Heo
	The current kernfs exportfs implementation uses the generic_fh_() helpers and FILEID_INO32_GEN[_PARENT] which limits ino to 32bits. Let's implement custom exportfs operations and fid type to remove the restriction. FILEID_KERNFS is a single u64 value whose content is kernfs_node->id. This is the only native fid type. * For backward compatibility with blk_log_action() path which exposes (ino,gen) pairs which userland assembles into FILEID_INO32_GEN keys, combine the generic keys into 64bit IDs in the same order. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12	kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()	Tejun Heo
	kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the specified ino. On top of that, kernfs_get_node_by_id() and kernfs_fh_get_inode() implement full ID matching by testing the rest of ID. On surface, confusingly, the two are slightly different in that the latter uses 0 gen as wildcard while the former doesn't - does it mean that the latter can't uniquely identify inodes w/ 0 gen? In practice, this is a distinction without a difference because generation number starts at 1. There are no actual IDs with 0 gen, so it can always safely used as wildcard. Let's simplify the code by renaming kernfs_find_and_get_node_by_ino() to kernfs_find_and_get_node_by_id(), moving all lookup logics into it, and removing now unnecessary kernfs_get_node_by_id(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12	kernfs: convert kernfs_node->id from union kernfs_node_id to u64	Tejun Heo
	kernfs_node->id is currently a union kernfs_node_id which represents either a 32bit (ino, gen) pair or u64 value. I can't see much value in the usage of the union - all that's needed is a 64bit ID which the current code is already limited to. Using a union makes the code unnecessarily complicated and prevents using 64bit ino without adding practical benefits. This patch drops union kernfs_node_id and makes kernfs_node->id a u64. ino is stored in the lower 32bits and gen upper. Accessors - kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the ino and gen. This simplifies ID handling less cumbersome and will allow using 64bit inos on supported archs. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12	kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes	Tejun Heo
	kernfs node can be created in two separate steps - allocation and activation. This is used to make kernfs nodes visible only after the internal states attached to the node are fully initialized. kernfs_find_and_get_node_by_id() currently allows lookups of nodes which aren't activated yet and thus can expose nodes are which are still being prepped by kernfs users. Fix it by disallowing lookups of nodes which aren't activated yet. kernfs_find_and_get_node_by_ino() Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12	kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()	Tejun Heo
	kernfs_find_and_get_node_by_ino() uses RCU protection. It's currently a bit buggy because it can look up a node which hasn't been activated yet and thus may end up exposing a node that the kernfs user is still prepping. While it can be fixed by pushing it further in the current direction, it's already complicated and isn't clear whether the complexity is justified. The main use of kernfs_find_and_get_node_by_ino() is for exportfs operations. They aren't super hot and all the follow-up operations (e.g. mapping to path) use normal locking anyway. Let's switch to a dumber locking scheme and protect the lookup with kernfs_idr_lock. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12	kernfs: fix ino wrap-around detection	Tejun Heo
	When the 32bit ino wraps around, kernfs increments the generation number to distinguish reused ino instances. The wrap-around detection tests whether the allocated ino is lower than what the cursor but the cursor is pointing to the next ino to allocate so the condition never triggers. Fix it by remembering the last ino and comparing against that. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 4a3ef68acacf ("kernfs: implement i_generation") Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@vger.kernel.org # v4.14+
2019-11-12	io_uring: use correct "is IO worker" helper	Jens Axboe
	Since we switched to io-wq, the dependent link optimization for when to pass back work inline has been broken. Fix this by providing a suitable io-wq helper for io_uring to use to detect when to do this. Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12	io_uring: make timeout sequence == 0 mean no sequence	Jens Axboe
	Currently we make sequence == 0 be the same as sequence == 1, but that's not super useful if the intent is really to have a timeout that's just a pure timeout. If the user passes in sqe->off == 0, then don't apply any sequence logic to the request, let it purely be driven by the timeout specified. Reported-by: 李通洲 <carter.li@eoitek.com> Reviewed-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	io_uring: fix -ENOENT issue with linked timer with short timeout	Jens Axboe
	If you prep a read (for example) that needs to get punted to async context with a timer, if the timeout is sufficiently short, the timer request will get completed with -ENOENT as it could not find the read. The issue is that we prep and start the timer before we start the read. Hence the timer can trigger before the read is even started, and the end result is then that the timer completes with -ENOENT, while the read starts instead of being cancelled by the timer. Fix this by splitting the linked timer into two parts: 1) Prep and validate the linked timer 2) Start timer The read is then started between steps 1 and 2, so we know that the timer will always have a consistent view of the read request state. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	io_uring: don't do flush cancel under inflight_lock	Jens Axboe
	We can't safely cancel under the inflight lock. If the work hasn't been started yet, then io_wq_cancel_work() simply marks the work as cancelled and invokes the work handler. But if the work completion needs to grab the inflight lock because it's grabbing user files, then we'll deadlock trying to finish the work as we already hold that lock. Instead grab a reference to the request, if it isn't already zero. If it's zero, then we know it's going through completion anyway, and we can safely ignore it. If it's not zero, then we can drop the lock and attempt to cancel from there. This also fixes a missing finish_wait() at the end of io_uring_cancel_files(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	io_uring: flag SQPOLL busy condition to userspace	Jens Axboe
	Now that we have backpressure, for SQPOLL, we have one more condition that warrants flagging that the application needs to enter the kernel: we failed to submit IO due to backpressure. Make sure we catch that and flag it appropriately. If we run into backpressure issues with the SQPOLL thread, flag it as such to the application by setting IORING_SQ_NEED_WAKEUP. This will cause the application to enter the kernel, and that will flush the backlog and clear the condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	io_uring: make ASYNC_CANCEL work with poll and timeout	Jens Axboe
	It's a little confusing that we have multiple types of command cancellation opcodes now that we have a generic one. Make the generic one work with POLL_ADD and TIMEOUT commands as well, that makes for an easier to use API for the application. The fact that they currently don't is a bit confusing. Add a helper that takes care of it, so we can user it from both IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	io_uring: provide fallback request for OOM situations	Jens Axboe
	One thing that really sucks for userspace APIs is if the kernel passes back -ENOMEM/-EAGAIN for resource shortages. The application really has no idea of what to do in those cases. Should it try and reap completions? Probably a good idea. Will it solve the issue? Who knows. This patch adds a simple fallback mechanism if we fail to allocate memory for a request. If we fail allocating memory from the slab for a request, we punt to a pre-allocated request. There's just one of these per io_ring_ctx, but the important part is if we ever return -EBUSY to the application, the applications knows that it can wait for events and make forward progress when events have completed. This is the important part. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11	iomap: fix return value of iomap_dio_bio_actor on 32bit systems	Jan Stancek
	Naresh reported LTP diotest4 failing for 32bit x86 and arm -next kernels on ext4. Same problem exists in 5.4-rc7 on xfs. The failure comes down to: openat(AT_FDCWD, "testdata-4.5918", O_RDWR\|O_DIRECT) = 4 mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0) = 0xb7f7b000 read(4, 0xb7f7b000, 4096) = 0 // expects -EFAULT Problem is conversion at iomap_dio_bio_actor() return. Ternary operator has a return type and an attempt is made to convert each of operands to the type of the other. In this case "ret" (int) is converted to type of "copied" (unsigned long). Both have size of 4 bytes: size_t copied = 0; int ret = -14; long long actor_ret = copied ? copied : ret; On x86_64: actor_ret == -14; On x86 : actor_ret == 4294967282 Replace ternary operator with 2 return statements to avoid this unwanted conversion. Fixes: 4721a6010990 ("iomap: dio data corruption and spurious errors when pipes fill") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-11	Btrfs: fix log context list corruption after rename exchange operation	Filipe Manana
	During rename exchange we might have successfully log the new name in the source root's log tree, in which case we leave our log context (allocated on stack) in the root's list of log contextes. However we might fail to log the new name in the destination root, in which case we fallback to a transaction commit later and never sync the log of the source root, which causes the source root log context to remain in the list of log contextes. This later causes invalid memory accesses because the context was allocated on stack and after rename exchange finishes the stack gets reused and overwritten for other purposes. The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can detect this and report something like the following: [ 691.489929] ------------[ cut here ]------------ [ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38). [ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0 (...) [ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1 [ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0 (...) [ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282 [ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000 [ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0 [ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114 [ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff [ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b [ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000 [ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0 [ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 691.490029] Call Trace: [ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs] [ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs] [ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs] [ 691.490111] ? dget_parent+0xb8/0x460 [ 691.490116] ? lock_downgrade+0x6b0/0x6b0 [ 691.490121] ? rwlock_bug.part.0+0x90/0x90 [ 691.490127] ? do_raw_spin_unlock+0x142/0x220 [ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs] [ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs] [ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs] [ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20 [ 691.490204] ? __do_sys_newstat+0x88/0xd0 [ 691.490207] ? cp_new_stat+0x5d0/0x5d0 [ 691.490218] ? do_fsync+0x38/0x60 [ 691.490220] do_fsync+0x38/0x60 [ 691.490224] __x64_sys_fdatasync+0x32/0x40 [ 691.490228] do_syscall_64+0x9f/0x540 [ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 691.490235] RIP: 0033:0x7f4b253ad5f0 (...) [ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0 [ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003 [ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c [ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4 [ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0 This started happening recently when running some test cases from fstests like btrfs/004 for example, because support for rename exchange was added last week to fsstress from fstests. So fix this by deleting the log context for the source root from the list if we have logged the new name in the source root. Reported-by: Su Yue <Damenly_Su@gmx.com> Fixes: d4682ba03ef618 ("Btrfs: sync log after logging new name") CC: stable@vger.kernel.org # 4.19+ Tested-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-11	fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long	Konstantin Khlebnikov
	Quota statistics counted as 64-bit per-cpu counter. Reading sums per-cpu fractions as signed 64-bit int, filters negative values and then reports lower half as signed 32-bit int. Result may looks like: fs.quota.allocated_dquots = 22327 fs.quota.cache_hits = -489852115 fs.quota.drops = -487288718 fs.quota.free_dquots = 22083 fs.quota.lookups = -486883485 fs.quota.reads = 22327 fs.quota.syncs = 335064 fs.quota.writes = 3088689 Values bigger than 2^31-1 reported as negative. All counters except "allocated_dquots" and "free_dquots" are monotonic, thus they should be reported as is without filtering negative values. Kernel doesn't have generic helper for 64-bit sysctl yet, let's use at least unsigned long. Link: https://lore.kernel.org/r/157337934693.2078.9842146413181153727.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jan Kara <jack@suse.cz>
2019-11-11	Merge tag 'v5.4-rc7' into sched/core, to pick up fixes	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-10	io_uring: convert accept4() -ERESTARTSYS into -EINTR	Jens Axboe
	If we cancel a pending accept operating with a signal, we get -ERESTARTSYS returned. Turn that into -EINTR for userspace, we should not be return -ERESTARTSYS. Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT") Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10	io_uring: fix error clear of ->file_table in io_sqe_files_register()	Jens Axboe
	syzbot reports that when using failslab and friends, we can get a double free in io_sqe_files_unregister(): BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450 kasan_slab_free+0xe/0x10 mm/kasan/common.c:480 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x2c0 mm/slab.c:3757 io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 io_ring_ctx_free fs/io_uring.c:3998 [inline] io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060 io_uring_release+0x42/0x50 fs/io_uring.c:4068 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x904/0x2e60 kernel/exit.c:817 do_group_exit+0x135/0x360 kernel/exit.c:921 __do_sys_exit_group kernel/exit.c:932 [inline] __se_sys_exit_group kernel/exit.c:930 [inline] __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x43f2c8 Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b 6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00 00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48 RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000 This happens if we fail allocating the file tables. For that case we do free the file table correctly, but we forget to set it to NULL. This means that ring teardown will see it as being non-NULL, and attempt to free it again. Fix this by clearing the file_table pointer if we free the table. Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com Fixes: 65e19f54d29c ("io_uring: support for larger fixed file sets") Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10	io_uring: separate the io_free_req and io_free_req_find_next interface	Jackie Liu
	Similar to the distinction between io_put_req and io_put_req_find_next, io_free_req has been modified similarly, with no functional changes. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10	io_uring: keep io_put_req only responsible for release and put req	Jackie Liu
	We already have io_put_req_find_next to find the next req of the link. we should not use the io_put_req function to find them. They should be functions of the same level. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>