From 278a5fbaed89dacd04e9d052f4594ffd0e0585de Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 24 May 2019 11:30:34 +0200 Subject: open: add close_range() This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. I was contacted by FreeBSD as they wanted to have the same close_range() syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to merge close_range() into FreeBSD already in April: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on us to merge this to switch on Linux as well: https://bugs.python.org/issue38061 The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc//fd/* and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR *dir; struct dirent *direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217 [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236 [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. Note, in a recent patch series Florian made grantpt() a nop thereby removing the code referenced here. [7]: https://github.com/rust-lang/rust/issues/12148 [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308 Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro Signed-off-by: Christian Brauner Cc: Arnd Bergmann Cc: Kyle Evans Cc: Jann Horn Cc: David Howells Cc: Dmitry V. Levin Cc: Oleg Nesterov Cc: Linus Torvalds Cc: Florian Weimer Cc: linux-api@vger.kernel.org --- fs/file.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 56 insertions(+), 8 deletions(-) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index abb8b7081d7a..1b8ff05e8311 100644 --- a/fs/file.c +++ b/fs/file.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file) EXPORT_SYMBOL(fd_install); -/* - * The same warnings as for __alloc_fd()/__fd_install() apply here... - */ -int __close_fd(struct files_struct *files, unsigned fd) +static struct file *pick_file(struct files_struct *files, unsigned fd) { - struct file *file; + struct file *file = NULL; struct fdtable *fdt; spin_lock(&files->file_lock); @@ -637,15 +635,65 @@ int __close_fd(struct files_struct *files, unsigned fd) goto out_unlock; rcu_assign_pointer(fdt->fd[fd], NULL); __put_unused_fd(files, fd); - spin_unlock(&files->file_lock); - return filp_close(file, files); out_unlock: spin_unlock(&files->file_lock); - return -EBADF; + return file; +} + +/* + * The same warnings as for __alloc_fd()/__fd_install() apply here... + */ +int __close_fd(struct files_struct *files, unsigned fd) +{ + struct file *file; + + file = pick_file(files, fd); + if (!file) + return -EBADF; + + return filp_close(file, files); } EXPORT_SYMBOL(__close_fd); /* for ksys_close() */ +/** + * __close_range() - Close all file descriptors in a given range. + * + * @fd: starting file descriptor to close + * @max_fd: last file descriptor to close + * + * This closes a range of file descriptors. All file descriptors + * from @fd up to and including @max_fd are closed. + */ +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd) +{ + unsigned int cur_max; + + if (fd > max_fd) + return -EINVAL; + + rcu_read_lock(); + cur_max = files_fdtable(files)->max_fds; + rcu_read_unlock(); + + /* cap to last valid index into fdtable */ + cur_max--; + + max_fd = min(max_fd, cur_max); + while (fd <= max_fd) { + struct file *file; + + file = pick_file(files, fd++); + if (!file) + continue; + + filp_close(file, files); + cond_resched(); + } + + return 0; +} + /* * variant of __close_fd that gets a ref on the file for later fput. * The caller must ensure that filp_close() called on the file, and then -- cgit From 60997c3d45d9a67daf01c56d805ae4fec37e0bd8 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 3 Jun 2020 21:48:55 +0200 Subject: close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner --- fs/file.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 58 insertions(+), 7 deletions(-) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index 1b8ff05e8311..340bc9569f9d 100644 --- a/fs/file.c +++ b/fs/file.c @@ -19,6 +19,7 @@ #include #include #include +#include unsigned int sysctl_nr_open __read_mostly = 1024*1024; unsigned int sysctl_nr_open_min = BITS_PER_LONG; @@ -265,12 +266,22 @@ static unsigned int count_open_files(struct fdtable *fdt) return i; } +static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds) +{ + unsigned int count; + + count = count_open_files(fdt); + if (max_fds < NR_OPEN_DEFAULT) + max_fds = NR_OPEN_DEFAULT; + return min(count, max_fds); +} + /* * Allocate a new files structure and copy contents from the * passed in files structure. * errorp will be valid only when the returned files_struct is NULL. */ -struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) +struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp) { struct files_struct *newf; struct file **old_fds, **new_fds; @@ -297,7 +308,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) spin_lock(&oldf->file_lock); old_fdt = files_fdtable(oldf); - open_files = count_open_files(old_fdt); + open_files = sane_fdtable_size(old_fdt, max_fds); /* * Check whether we need to allocate a larger fd array and fd set. @@ -328,7 +339,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) */ spin_lock(&oldf->file_lock); old_fdt = files_fdtable(oldf); - open_files = count_open_files(old_fdt); + open_files = sane_fdtable_size(old_fdt, max_fds); } copy_fd_bitmaps(new_fdt, old_fdt, open_files); @@ -665,32 +676,72 @@ EXPORT_SYMBOL(__close_fd); /* for ksys_close() */ * This closes a range of file descriptors. All file descriptors * from @fd up to and including @max_fd are closed. */ -int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd) +int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) { unsigned int cur_max; + struct task_struct *me = current; + struct files_struct *cur_fds = me->files, *fds = NULL; + + if (flags & ~CLOSE_RANGE_UNSHARE) + return -EINVAL; if (fd > max_fd) return -EINVAL; rcu_read_lock(); - cur_max = files_fdtable(files)->max_fds; + cur_max = files_fdtable(cur_fds)->max_fds; rcu_read_unlock(); /* cap to last valid index into fdtable */ cur_max--; + if (flags & CLOSE_RANGE_UNSHARE) { + int ret; + unsigned int max_unshare_fds = NR_OPEN_MAX; + + /* + * If the requested range is greater than the current maximum, + * we're closing everything so only copy all file descriptors + * beneath the lowest file descriptor. + */ + if (max_fd >= cur_max) + max_unshare_fds = fd; + + ret = unshare_fd(CLONE_FILES, max_unshare_fds, &fds); + if (ret) + return ret; + + /* + * We used to share our file descriptor table, and have now + * created a private one, make sure we're using it below. + */ + if (fds) + swap(cur_fds, fds); + } + max_fd = min(max_fd, cur_max); while (fd <= max_fd) { struct file *file; - file = pick_file(files, fd++); + file = pick_file(cur_fds, fd++); if (!file) continue; - filp_close(file, files); + filp_close(file, cur_fds); cond_resched(); } + if (fds) { + /* + * We're done closing the files we were supposed to. Time to install + * the new file descriptor table and drop the old one. + */ + task_lock(me); + me->files = cur_fds; + task_unlock(me); + put_files_struct(fds); + } + return 0; } -- cgit From 6659061045cc93f609e100b128f30581e5f012e9 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 10 Jun 2020 08:20:05 -0700 Subject: fs: Move __scm_install_fd() to __receive_fd() In preparation for users of the "install a received file" logic outside of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper named receive_fd_user(), as future patches will change the interface to __receive_fd(). Additionally add a comment to fd_install() as a counterpoint to how __receive_fd() interacts with fput(). Cc: Alexander Viro Cc: "David S. Miller" Cc: Jakub Kicinski Cc: Dmitry Kadashev Cc: Jens Axboe Cc: Arnd Bergmann Cc: Sargun Dhillon Cc: Ido Schimmel Cc: Ioana Ciornei Cc: linux-fsdevel@vger.kernel.org Cc: netdev@vger.kernel.org Reviewed-by: Sargun Dhillon Acked-by: Christian Brauner Signed-off-by: Kees Cook --- fs/file.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index abb8b7081d7a..0cd598cab476 100644 --- a/fs/file.c +++ b/fs/file.c @@ -18,6 +18,7 @@ #include #include #include +#include unsigned int sysctl_nr_open __read_mostly = 1024*1024; unsigned int sysctl_nr_open_min = BITS_PER_LONG; @@ -613,6 +614,10 @@ void __fd_install(struct files_struct *files, unsigned int fd, rcu_read_unlock_sched(); } +/* + * This consumes the "file" refcount, so callers should treat it + * as if they had called fput(file). + */ void fd_install(unsigned int fd, struct file *file) { __fd_install(current->files, fd, file); @@ -931,6 +936,46 @@ out_unlock: return err; } +/** + * __receive_fd() - Install received file into file descriptor table + * + * @file: struct file that was received from another process + * @ufd: __user pointer to write new fd number to + * @o_flags: the O_* flags to apply to the new fd entry + * + * Installs a received file into the file descriptor table, with appropriate + * checks and count updates. Writes the fd number to userspace. + * + * This helper handles its own reference counting of the incoming + * struct file. + * + * Returns -ve on error. + */ +int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags) +{ + int new_fd; + int error; + + error = security_file_receive(file); + if (error) + return error; + + new_fd = get_unused_fd_flags(o_flags); + if (new_fd < 0) + return new_fd; + + error = put_user(new_fd, ufd); + if (error) { + put_unused_fd(new_fd); + return error; + } + + /* Bump the sock usage counts, if any. */ + __receive_sock(file); + fd_install(new_fd, get_file(file)); + return 0; +} + static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags) { int err = -EBADF; -- cgit From deefa7f3505ae2fb6a7cb75f50134b65a1dd1494 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 10 Jun 2020 20:47:45 -0700 Subject: fs: Add receive_fd() wrapper for __receive_fd() For both pidfd and seccomp, the __user pointer is not used. Update __receive_fd() to make writing to ufd optional via a NULL check. However, for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT can be returned to avoid changing the SCM_RIGHTS interface behavior. Add new wrapper receive_fd() for pidfd and seccomp that does not use the ufd argument. For the new helper, the allocated fd needs to be returned on success. Update the existing callers to handle it. Cc: Alexander Viro Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon Acked-by: Christian Brauner Signed-off-by: Kees Cook --- fs/file.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index 0cd598cab476..56d96d5c0c9f 100644 --- a/fs/file.c +++ b/fs/file.c @@ -944,12 +944,13 @@ out_unlock: * @o_flags: the O_* flags to apply to the new fd entry * * Installs a received file into the file descriptor table, with appropriate - * checks and count updates. Writes the fd number to userspace. + * checks and count updates. Optionally writes the fd number to userspace, if + * @ufd is non-NULL. * * This helper handles its own reference counting of the incoming * struct file. * - * Returns -ve on error. + * Returns newly install fd or -ve on error. */ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags) { @@ -964,16 +965,18 @@ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags) if (new_fd < 0) return new_fd; - error = put_user(new_fd, ufd); - if (error) { - put_unused_fd(new_fd); - return error; + if (ufd) { + error = put_user(new_fd, ufd); + if (error) { + put_unused_fd(new_fd); + return error; + } } /* Bump the sock usage counts, if any. */ __receive_sock(file); fd_install(new_fd, get_file(file)); - return 0; + return new_fd; } static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags) -- cgit From 173817151b15d5a72a9bef1d2df7e6e7f6750f2e Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 10 Jun 2020 08:46:58 -0700 Subject: fs: Expand __receive_fd() to accept existing fd Expand __receive_fd() with support for replace_fd() for the coming seccomp "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior and update existing wrappers to retain old behavior. Thanks to Colin Ian King for pointing out an uninitialized variable exposure in an earlier version of this patch. Cc: Alexander Viro Cc: Dmitry Kadashev Cc: Jens Axboe Cc: Arnd Bergmann Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon Acked-by: Christian Brauner Signed-off-by: Kees Cook --- fs/file.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index 56d96d5c0c9f..4fb111735d1d 100644 --- a/fs/file.c +++ b/fs/file.c @@ -939,6 +939,7 @@ out_unlock: /** * __receive_fd() - Install received file into file descriptor table * + * @fd: fd to install into (if negative, a new fd will be allocated) * @file: struct file that was received from another process * @ufd: __user pointer to write new fd number to * @o_flags: the O_* flags to apply to the new fd entry @@ -952,7 +953,7 @@ out_unlock: * * Returns newly install fd or -ve on error. */ -int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags) +int __receive_fd(int fd, struct file *file, int __user *ufd, unsigned int o_flags) { int new_fd; int error; @@ -961,21 +962,33 @@ int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags) if (error) return error; - new_fd = get_unused_fd_flags(o_flags); - if (new_fd < 0) - return new_fd; + if (fd < 0) { + new_fd = get_unused_fd_flags(o_flags); + if (new_fd < 0) + return new_fd; + } else { + new_fd = fd; + } if (ufd) { error = put_user(new_fd, ufd); if (error) { - put_unused_fd(new_fd); + if (fd < 0) + put_unused_fd(new_fd); return error; } } + if (fd < 0) { + fd_install(new_fd, get_file(file)); + } else { + error = replace_fd(new_fd, file, o_flags); + if (error) + return error; + } + /* Bump the sock usage counts, if any. */ __receive_sock(file); - fd_install(new_fd, get_file(file)); return new_fd; } -- cgit From bc1cd99a9ad7e2c768e7ea92ae9c6ad4a4e0f7f7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 14 Jul 2020 08:58:49 +0200 Subject: fs: remove ksys_dup Fold it into the only remaining caller. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- fs/file.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'fs/file.c') diff --git a/fs/file.c b/fs/file.c index abb8b7081d7a..85b7993165dd 100644 --- a/fs/file.c +++ b/fs/file.c @@ -985,7 +985,7 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd) return ksys_dup3(oldfd, newfd, 0); } -int ksys_dup(unsigned int fildes) +SYSCALL_DEFINE1(dup, unsigned int, fildes) { int ret = -EBADF; struct file *file = fget_raw(fildes); @@ -1000,11 +1000,6 @@ int ksys_dup(unsigned int fildes) return ret; } -SYSCALL_DEFINE1(dup, unsigned int, fildes) -{ - return ksys_dup(fildes); -} - int f_dupfd(unsigned int from, struct file *file, unsigned flags) { int err; -- cgit