summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-07-28xfs: fix stack contents leakage in the v1 inumber ioctlsDarrick J. Wong
Explicitly initialize the onstack structures to zero so we don't leak kernel memory into userspace when converting the in-core inumbers structure to the v1 inogrp ioctl structure. Add a comment about why we have to use memset to ensure that the padding holes in the structures are set to zero. Fixes: 5f19c7fc6873351 ("xfs: introduce v5 inode group structure") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-07-28fs-verity: add data verification hooks for ->readpages()Eric Biggers
Add functions that verify data pages that have been read from a fs-verity file, against that file's Merkle tree. These will be called from filesystems' ->readpage() and ->readpages() methods. Since data verification can block, a workqueue is provided for these methods to enqueue verification work from their bio completion callback. See the "Verifying data" section of Documentation/filesystems/fsverity.rst for more information. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-28fs-verity: add the hook for file ->setattr()Eric Biggers
Add a function fsverity_prepare_setattr() which filesystems that support fs-verity must call to deny truncates of verity files. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-28fs-verity: add the hook for file ->open()Eric Biggers
Add the fsverity_file_open() function, which prepares an fs-verity file to be read from. If not already done, it loads the fs-verity descriptor from the filesystem and sets up an fsverity_info structure for the inode which describes the Merkle tree and contains the file measurement. It also denies all attempts to open verity files for writing. This commit also begins the include/linux/fsverity.h header, which declares the interface between fs/verity/ and filesystems. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-28fs-verity: add Kconfig and the helper functions for hashingEric Biggers
Add the beginnings of the fs/verity/ support layer, including the Kconfig option and various helper functions for hashing. To start, only SHA-256 is supported, but other hash algorithms can easily be added. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-28Merge tag 'spdx-5.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX fixes from Greg KH: "Here are some small SPDX fixes for 5.3-rc2 for things that came in during the 5.3-rc1 merge window that we previously missed. Only three small patches here: - two uapi patches to resolve some SPDX tags that were not correct - fix an invalid SPDX tag in the iomap Makefile file All have been properly reviewed on the public mailing lists" * tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: iomap: fix Invalid License ID treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers again treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers
2019-07-27Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two fixes for the fair scheduling class: - Prevent freeing memory which is accessible by concurrent readers - Make the RCU annotations for numa groups consistent" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Use RCU accessors consistently for ->numa_group sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27Merge tag 'Wimplicit-fallthrough-5.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull Wimplicit-fallthrough enablement from Gustavo A. R. Silva: "This marks switch cases where we are expecting to fall through, and globally enables the -Wimplicit-fallthrough option in the main Makefile. Finally, some missing-break fixes that have been tagged for -stable: - drm/amdkfd: Fix missing break in switch statement - drm/amdgpu/gfx10: Fix missing break in switch statement With these changes, we completely get rid of all the fall-through warnings in the kernel" * tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: Makefile: Globally enable fall-through warning drm/i915: Mark expected switch fall-throughs drm/amd/display: Mark expected switch fall-throughs drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning drm/amdgpu/gfx10: Fix missing break in switch statement drm/amdkfd: Fix missing break in switch statement perf/x86/intel: Mark expected switch fall-throughs mtd: onenand_base: Mark expected switch fall-through afs: fsclient: Mark expected switch fall-throughs afs: yfsclient: Mark expected switch fall-throughs can: mark expected switch fall-throughs firewire: mark expected switch fall-throughs
2019-07-27autofs_lookup(): hold ->d_lock over playing with ->d_flagsAl Viro
... as well as setting ->d_fsdata, etc. Make all of that atomic. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-27get rid of autofs_info->active_countAl Viro
autofs_add_active() is always called only once (and on a dentry with freshly allocated ino, at that). autofs_del_active() is never called more than once. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-26f2fs: fix to read source block before invalidating itJaegeuk Kim
f2fs_allocate_data_block() invalidates old block address and enable new block address. Then, if we try to read old block by f2fs_submit_page_bio(), it will give WARN due to reading invalid blocks. Let's make the order sanely back. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-26Merge tag 'for-5.3-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two regression fixes: - hangs caused by a missing barrier in the locking code - memory leaks of extent_state due to bad handling of a cached pointer" * tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range btrfs: Fix deadlock caused by missing memory barrier
2019-07-26Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs umount_tree() leak fix from Al Viro: "Fix braino introduced in 'switch the remnants of releasing the mountpoint away from fs_pin'. The most visible result is leaking struct mount when mounting btrfs, making it impossible to shut down" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix the struct mount leak in umount_tree()
2019-07-26Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
2019-07-26fix the struct mount leak in umount_tree()Al Viro
We need to drop everything we remove from the tree, whether mnt_has_parent() is true or not. Usually the bug manifests as a slow memory leak (leaked struct mount for initramfs); it becomes much more visible in mount_subtree() users, such as btrfs. There we leak a struct mount for btrfs superblock being mounted, which prevents fs shutdown on subsequent umount. Fixes: 56cbb429d911 ("switch the remnants of releasing the mountpoint away from fs_pin") Reported-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-26btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_rangeNaohiro Aota
btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into cachedp, which, in general, is NULL. Then, lock_extent_bits() updates "cachedp", but it never goes backs to the caller. Thus the caller still see its "cached_state" to be NULL and never free the state allocated under btrfs_lock_and_flush_ordered_range(). As a result, we will see massive state leak with e.g. fstests btrfs/005. Fix this bug by properly handling the pointers. Fixes: bd80d94efb83 ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-25afs: fsclient: Mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: Warning level 3 was used: -Wimplicit-fallthrough=3 fs/afs/fsclient.c: In function ‘afs_deliver_fs_fetch_acl’: fs/afs/fsclient.c:2199:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2202:2: note: here case 1: ^~~~ fs/afs/fsclient.c:2216:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2219:2: note: here case 2: ^~~~ fs/afs/fsclient.c:2225:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2228:2: note: here case 3: ^~~~ This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-07-25afs: yfsclient: Mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: fs/afs/yfsclient.c: In function ‘yfs_deliver_fs_fetch_opaque_acl’: fs/afs/yfsclient.c:1984:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/yfsclient.c:1987:2: note: here case 1: ^~~~ fs/afs/yfsclient.c:2005:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/yfsclient.c:2008:2: note: here case 2: ^~~~ fs/afs/yfsclient.c:2014:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/yfsclient.c:2017:2: note: here case 3: ^~~~ fs/afs/yfsclient.c:2035:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/yfsclient.c:2038:2: note: here case 4: ^~~~ fs/afs/yfsclient.c:2047:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/yfsclient.c:2050:2: note: here case 5: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 Also, fix some commenting style issues. This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2019-07-25io_uring: ensure ->list is initialized for poll commandsJens Axboe
Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: Daniel Kozak <kozzi11@gmail.com> Tested-by: Daniel Kozak <kozzi11@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-25Merge branch 'access-creds'Linus Torvalds
The access() (and faccessat()) credentials change can cause an unnecessary load on the RCU machinery because every access() call ends up freeing the temporary access credential using RCU. This isn't really noticeable on small machines, but if you have hundreds of cores you can cause huge slowdowns due to RCU storms. It's easy to avoid: the temporary access crededntials aren't actually normally accessed using RCU at all, so we can avoid the whole issue by just marking them as such. * access-creds: access: avoid the RCU grace period for the temporary subjective credentials
2019-07-25btrfs: Fix deadlock caused by missing memory barrierNikolay Borisov
Commit 06297d8cefca ("btrfs: switch extent_buffer blocking_writers from atomic to int") changed the type of blocking_writers but forgot to adjust relevant code in btrfs_tree_unlock by converting the smp_mb__after_atomic to smp_mb. This opened up the possibility of a deadlock due to re-ordering of setting blocking_writers and checking/waking up the waiter. This particular lockup is explained in a comment above waitqueue_active() function. Fix it by converting the memory barrier to a full smp_mb, accounting for the fact that blocking_writers is a simple integer. Fixes: 06297d8cefca ("btrfs: switch extent_buffer blocking_writers from atomic to int") Tested-by: Johannes Thumshirn <jthumshirn@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-25sched/fair: Don't free p->numa_faults with concurrent readersJann Horn
When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25fs: kernfs: Fix possible null-pointer dereferences in ↵Jia-Ju Bai
kernfs_path_from_node_locked() In kernfs_path_from_node_locked(), there is an if statement on line 147 to check whether buf is NULL: if (buf) When buf is NULL, it is used on line 151: len += strlcpy(buf + len, parent_str, ...) and line 158: len += strlcpy(buf + len, "/", ...) and line 160: len += strlcpy(buf + len, kn->name, ...) Thus, possible null-pointer dereferences may occur. To fix these possible bugs, buf is checked before being used. If it is NULL, -EINVAL is returned. These bugs are found by a static analysis tool STCheck written by us. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Link: https://lore.kernel.org/r/20190724022242.27505-1-baijiaju1990@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-25kernfs: fix potential null pointer dereferencePeng Wang
Get root safely after kn is ensureed to be not null. Signed-off-by: Peng Wang <rocking@whu.edu.cn> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20190708151611.13242-1-rocking@whu.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-25locks: Fix procfs output for file leasesPavel Begunkov
Since commit 778fc546f749c588aa2f ("locks: fix tracking of inprogress lease breaks"), leases break don't change @fl_type but modifies @fl_flags. However, procfs's part haven't been updated. Previously, for a breaking lease the target type was printed (see target_leasetype()), as returns fcntl(F_GETLEASE). But now it's always "READ", as F_UNLCK no longer means "breaking". Unlike the previous one, this behaviour don't provide a complete description of the lease. There are /proc/pid/fdinfo/ outputs for a lease (the same for READ and WRITE) breaked by O_WRONLY. -- before: lock: 1: LEASE BREAKING READ 2558 08:03:815793 0 EOF -- after: lock: 1: LEASE BREAKING UNLCK 2558 08:03:815793 0 EOF Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2019-07-25iomap: fix Invalid License IDMasahiro Yamada
Detected by: $ ./scripts/spdxcheck.py fs/iomap/Makefile: 1:27 Invalid License ID: GPL-2.0-or-newer Fixes: 1c230208f53d ("iomap: start moving code to fs/iomap/") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-24autofs: simplify get_next_positive_...(), get rid of trylocksAl Viro
* new helper: positive_after(parent, child); parent->d_lock is held by caller, grabs and returns the first thing after child in the list of children that has simple_positive() true. NULL if nothing's found; NULL child == search the entire list. * get_next_positive_subdir() loses the redundant check for d_count and switches to use of that helper. BTW, dput(NULL) is a no-op for a good reason... * get_next_positive_dentry() switched to the same helper. Logics: look for positive child in prev; if not found, look for the positive child of prev's parent following prev, etc. That way we are guaranteed that we are only moving rootwards through the ancestors of prev, which is pinned and thus not going anywhere. Since ->d_parent on autofs never changes, the same goes for the entire chain of ancestors and we don't need overlapping ->d_lock on them. Which avoids the trylock loops, in addition to simplifying the logics in there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-24access: avoid the RCU grace period for the temporary subjective credentialsLinus Torvalds
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU work because it installs a temporary credential that gets allocated and freed for each system call. The allocation and freeing overhead is mostly benign, but because credentials can be accessed under the RCU read lock, the freeing involves a RCU grace period. Which is not a huge deal normally, but if you have a lot of access() calls, this causes a fair amount of seconday damage: instead of having a nice alloc/free patterns that hits in hot per-CPU slab caches, you have all those delayed free's, and on big machines with hundreds of cores, the RCU overhead can end up being enormous. But it turns out that all of this is entirely unnecessary. Exactly because access() only installs the credential as the thread-local subjective credential, the temporary cred pointer doesn't actually need to be RCU free'd at all. Once we're done using it, we can just free it synchronously and avoid all the RCU overhead. So add a 'non_rcu' flag to 'struct cred', which can be set by users that know they only use it in non-RCU context (there are other potential users for this). We can make it a union with the rcu freeing list head that we need for the RCU case, so this doesn't need any extra storage. Note that this also makes 'get_current_cred()' clear the new non_rcu flag, in case we have filesystems that take a long-term reference to the cred and then expect the RCU delayed freeing afterwards. It's not entirely clear that this is required, but it makes for clear semantics: the subjective cred remains non-RCU as long as you only access it synchronously using the thread-local accessors, but you _can_ use it as a generic cred if you want to. It is possible that we should just remove the whole RCU markings for ->cred entirely. Only ->real_cred is really supposed to be accessed through RCU, and the long-term cred copies that nfs uses might want to explicitly re-enable RCU freeing if required, rather than have get_current_cred() do it implicitly. But this is a "minimal semantic changes" change for the immediate problem. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Glauber <jglauber@marvell.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-22Merge tag 'for-5.3-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fixes for leaks caused by recently merged patches - one build fix - a fix to prevent mixing of incompatible features * tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't leak extent_map in btrfs_get_io_geometry() btrfs: free checksum hash on in close_ctree btrfs: Fix build error while LIBCRC32C is module btrfs: inode: Don't compress if NODATASUM or NODATACOW set
2019-07-21io_uring: track io length in async_list based on bytesZhengyuan Liu
We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-21io_uring: don't use iov_iter_advance() for fixed buffersJens Axboe
Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-21block: properly handle IOCB_NOWAIT for async O_DIRECT IOJens Axboe
A caller is supposed to pass in REQ_NOWAIT if we can't block for any given operation, but O_DIRECT for block devices just ignore this. Hence we'll block for various resource shortages on the block layer side, like having to wait for requests. Use the new REQ_NOWAIT_INLINE to ask for this error to be returned inline, so we can handle it appropriately and return -EAGAIN to the caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-21audit_inode(): switch to passing AUDIT_INODE_...Al Viro
don't bother with remapping LOOKUP_... values - all callers pass constants and we can just as well pass the right ones from the very beginning. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21filename_mountpoint(): make LOOKUP_NO_EVAL unconditional thereAl Viro
user_path_mountpoint_at() always gets it and the reasons to have it there (i.e. in umount(2)) apply to kern_path_mountpoint() callers as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21filename_lookup(): audit_inode() argument is always 0Al Viro
We hadn't been passing LOOKUP_PARENT in flags to that thing since filename_parentat() had been split off back in 2015. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21Merge tag '5.3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Two fixes for stable, one that had dependency on earlier patch in this merge window and can now go in, and a perf improvement in SMB3 open" * tag '5.3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module number cifs: flush before set-info if we have writeable handles smb3: optimize open to not send query file internal info cifs: copy_file_range needs to strip setuid bits and update timestamps CIFS: fix deadlock in cached root handling
2019-07-20Merge branch 'work.dcache2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache and mountpoint updates from Al Viro: "Saner handling of refcounts to mountpoints. Transfer the counting reference from struct mount ->mnt_mountpoint over to struct mountpoint ->m_dentry. That allows us to get rid of the convoluted games with ordering of mount shutdowns. The cost is in teaching shrink_dcache_{parent,for_umount} to cope with mixed-filesystem shrink lists, which we'll also need for the Slab Movable Objects patchset" * 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch the remnants of releasing the mountpoint away from fs_pin get rid of detach_mnt() make struct mountpoint bear the dentry reference to mountpoint, not struct mount Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt() __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore nfs: dget_parent() never returns NULL ceph: don't open-code the check for dead lockref
2019-07-19Merge tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap split/cleanup from Darrick Wong: "As promised, here's the second part of the iomap merge for 5.3, in which we break up iomap.c into smaller files grouped by functional area so that it'll be easier in the long run to maintain cohesiveness of code units and to review incoming patches. There are no functional changes and fs/iomap.c split cleanly. Summary: - Regroup the fs/iomap.c code by major functional area so that we can start development for 5.4 from a more stable base" * tag 'iomap-5.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: move internal declarations into fs/iomap/ iomap: move the main iteration code into a separate file iomap: move the buffered IO code into a separate file iomap: move the direct IO code into a separate file iomap: move the SEEK_HOLE code into a separate file iomap: move the file mapping reporting code into a separate file iomap: move the swapfile code into a separate file iomap: start moving code to fs/iomap/
2019-07-19Merge branch 'work.adfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull adfs updates from Al Viro: "More ADFS patches from Russell King" * 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/adfs: add time stamp and file type helpers fs/adfs: super: limit idlen according to directory type fs/adfs: super: fix use-after-free bug fs/adfs: super: safely update options on remount fs/adfs: super: correct superblock flags fs/adfs: clean up indirect disc addresses and fragment IDs fs/adfs: clean up error message printing fs/adfs: use %pV for error messages fs/adfs: use format_version from disc_record fs/adfs: add helper to get filesystem size fs/adfs: add helper to get discrecord from map fs/adfs: correct disc record structure
2019-07-19Merge branch 'work.mount0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ...
2019-07-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "The rest of MM and a kernel-wide procfs cleanup. Summary of the more significant patches: - Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3. David Hildenbrand. Some spring-cleaning of the memory hotplug code, notably in drivers/base/memory.c - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang Shi. Fix /proc/pid/smaps output for THP pages used in shmem. - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit. Bugfix and speedup for kernel/resource.c - Patch series "mm: Further memory block device cleanups", David Hildenbrand. More spring-cleaning of the memory hotplug code. - Patch series "mm: Sub-section memory hotplug support". Dan Williams. Generalise the memory hotplug code so that pmem can use it more completely. Then remove the hacks from the libnvdimm code which were there to work around the memory-hotplug code's constraints. - "proc/sysctl: add shared variables for range check", Matteo Croce. We have about 250 instances of int zero; ... .extra1 = &zero, in the tree. This is a tree-wide sweep to make all those private "zero"s and "one"s use global variables. Alas, it isn't practical to make those two global integers const" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits) proc/sysctl: add shared variables for range check mm: migrate: remove unused mode argument mm/sparsemem: cleanup 'section number' data types libnvdimm/pfn: stop padding pmem namespaces to section alignment libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields mm/devm_memremap_pages: enable sub-section remap mm: document ZONE_DEVICE memory-model implications mm/sparsemem: support sub-section hotplug mm/sparsemem: prepare for sub-section ranges mm: kill is_dev_zone() helper mm/hotplug: kill is_dev_zone() usage in __remove_pages() mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal mm/sparsemem: add helpers track active portions of a section at boot mm/sparsemem: introduce a SECTION_IS_EARLY flag mm/sparsemem: introduce struct mem_section_usage drivers/base/memory.c: get rid of find_memory_block_hinted() mm/memory_hotplug: move and simplify walk_memory_blocks() mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns mm: make register_mem_sect_under_node() static ...
2019-07-18proc/sysctl: add shared variables for range checkMatteo Croce
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: migrate: remove unused mode argumentKeith Busch
migrate_page_move_mapping() doesn't use the mode argument. Remove it and update callers accordingly. Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: thp: fix false negative of shmem vma's THP eligibilityYang Shi
Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") introduced THPeligible bit for processes' smaps. But, when checking the eligibility for shmem vma, __transparent_hugepage_enabled() is called to override the result from shmem_huge_enabled(). It may result in the anonymous vma's THP flag override shmem's. For example, running a simple test which create THP for shmem, but with anonymous THP disabled, when reading the process's smaps, it may show: 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test Size: 4096 kB ... [snip] ... ShmemPmdMapped: 4096 kB ... [snip] ... THPeligible: 0 And, /proc/meminfo does show THP allocated and PMD mapped too: ShmemHugePages: 4096 kB ShmemPmdMapped: 4096 kB This doesn't make too much sense. The shmem objects should be treated separately from anonymous THP. Calling shmem_huge_enabled() with checking MMF_DISABLE_THP sounds good enough. And, we could skip stack and dax vma check since we already checked if the vma is shmem already. Also check if vma is suitable for THP by calling transhuge_vma_suitable(). And minor fix to smaps output format and documentation. Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18cifs: update internal module numberSteve French
To 2.21 Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-18cifs: flush before set-info if we have writeable handlesRonnie Sahlberg
Servers can defer destaging any data and updating the mtime until close(). This means that if we do a setinfo to modify the mtime while other handles are open for write the server may overwrite our setinfo timestamps when if flushes the file on close() of the writeable handle. To solve this we add an explicit flush when the mtime is about to be updated. This fixes "cp -p" to preserve mtime when copying a file onto an SMB2 share. CC: Stable <stable@vger.kernel.org> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-18smb3: optimize open to not send query file internal infoSteve French
We can cut one third of the traffic on open by not querying the inode number explicitly via SMB3 query_info since it is now returned on open in the qfid context. This is better in multiple ways, and speeds up file open about 10% (more if network is slow). Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-18Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request - Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference - Revert buggy NFS readdirplus optimisation - NFSv4: Handle the special Linux file open access mode - pnfs: Fix a problem where we gratuitously start doing I/O through the MDS Features: - Allow NFS client to set up multiple TCP connections to the server using a new 'nconnect=X' mount option. Queue length is used to balance load. - Enhance statistics reporting to report on all transports when using multiple connections. - Speed up SUNRPC by removing bh-safe spinlocks - Add a mechanism to allow NFSv4 to request that containers set a unique per-host identifier for when the hostname is not set. - Ensure NFSv4 updates the lease_time after a clientid update Bugfixes and cleanup: - Fix use-after-free in rpcrdma_post_recvs - Fix a memory leak when nfs_match_client() is interrupted - Fix buggy file access checking in NFSv4 open for execute - disable unsupported client side deduplication - Fix spurious client disconnections - Fix occasional RDMA transport deadlock - Various RDMA cleanups - Various tracepoint fixes - Fix the TCP callback channel to guarantee the server can actually send the number of callback requests that was negotiated at mount time" * tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits) pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS pnfs: Fix a problem where we gratuitously start doing I/O through the MDS SUNRPC: Optimise transport balancing code SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error NFSv4: Don't use the zero stateid with layoutget SUNRPC: Fix up backchannel slot table accounting SUNRPC: Fix initialisation of struct rpc_xprt_switch SUNRPC: Skip zero-refcount transports SUNRPC: Replace division by multiplication in calculation of queue length NFSv4: Validate the stateid before applying it to state recovery nfs4.0: Refetch lease_time after clientid update nfs4: Rename nfs41_setup_state_renewal nfs4: Make nfs4_proc_get_lease_time available for nfs4.0 nfs: Fix copy-and-paste error in debug message NFS: Replace 16 seq_printf() calls by seq_puts() NFS: Use seq_putc() in nfs_show_stats() Revert "NFS: readdirplus optimization by cache mechanism" (memleak) SUNRPC: Fix transport accounting when caller specifies an rpc_xprt NFS: Record task, client ID, and XID in xdr_status trace points ...
2019-07-18pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDSTrond Myklebust
Add tracepoints to allow debugging of the event chain leading to a pnfs fallback to doing I/O through the MDS. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18pnfs: Fix a problem where we gratuitously start doing I/O through the MDSTrond Myklebust
If the client has to stop in pnfs_update_layout() to wait for another layoutget to complete, it currently exits and defaults to I/O through the MDS if the layoutget was successful. Fixes: d03360aaf5cc ("pNFS: Ensure we return the error if someone kills...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.20+