From 278a5fbaed89dacd04e9d052f4594ffd0e0585de Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 24 May 2019 11:30:34 +0200 Subject: open: add close_range() This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. I was contacted by FreeBSD as they wanted to have the same close_range() syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to merge close_range() into FreeBSD already in April: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on us to merge this to switch on Linux as well: https://bugs.python.org/issue38061 The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc//fd/* and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR *dir; struct dirent *direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217 [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236 [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. Note, in a recent patch series Florian made grantpt() a nop thereby removing the code referenced here. [7]: https://github.com/rust-lang/rust/issues/12148 [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308 Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro Signed-off-by: Christian Brauner Cc: Arnd Bergmann Cc: Kyle Evans Cc: Jann Horn Cc: David Howells Cc: Dmitry V. Levin Cc: Oleg Nesterov Cc: Linus Torvalds Cc: Florian Weimer Cc: linux-api@vger.kernel.org --- fs/open.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 6cd48a61cda3..073ea3c45347 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1310,6 +1310,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd) return retval; } +/** + * close_range() - Close all file descriptors in a given range. + * + * @fd: starting file descriptor to close + * @max_fd: last file descriptor to close + * @flags: reserved for future extensions + * + * This closes a range of file descriptors. All file descriptors + * from @fd up to and including @max_fd are closed. + * Currently, errors to close a given file descriptor are ignored. + */ +SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd, + unsigned int, flags) +{ + if (flags) + return -EINVAL; + + return __close_range(current->files, fd, max_fd); +} + /* * This routine simulates a hangup on the tty, to arrange that users * are given clean terminals at login time. -- cgit From 60997c3d45d9a67daf01c56d805ae4fec37e0bd8 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 3 Jun 2020 21:48:55 +0200 Subject: close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner --- fs/open.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 073ea3c45347..5e62f18adc5b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1324,10 +1324,7 @@ SYSCALL_DEFINE1(close, unsigned int, fd) SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd, unsigned int, flags) { - if (flags) - return -EINVAL; - - return __close_range(current->files, fd, max_fd); + return __close_range(fd, max_fd, flags); } /* -- cgit From c04011fe8cbd80af1be6e12b53193bf3846750d7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 14 Jul 2020 08:47:43 +0200 Subject: fs: add a vfs_fchown helper Add a helper for struct file based chown operations. To be used by the initramfs code soon. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- fs/open.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 6cd48a61cda3..103c66309bee 100644 --- a/fs/open.c +++ b/fs/open.c @@ -740,23 +740,28 @@ SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group AT_SYMLINK_NOFOLLOW); } +int vfs_fchown(struct file *file, uid_t user, gid_t group) +{ + int error; + + error = mnt_want_write_file(file); + if (error) + return error; + audit_file(file); + error = chown_common(&file->f_path, user, group); + mnt_drop_write_file(file); + return error; +} + int ksys_fchown(unsigned int fd, uid_t user, gid_t group) { struct fd f = fdget(fd); int error = -EBADF; - if (!f.file) - goto out; - - error = mnt_want_write_file(f.file); - if (error) - goto out_fput; - audit_file(f.file); - error = chown_common(&f.file->f_path, user, group); - mnt_drop_write_file(f.file); -out_fput: - fdput(f); -out: + if (f.file) { + error = vfs_fchown(f.file, user, group); + fdput(f); + } return error; } -- cgit From 9e96c8c0e94eea2f69a9705f5d0f51928ea26c17 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 14 Jul 2020 08:55:05 +0200 Subject: fs: add a vfs_fchmod helper Add a helper for struct file based chmode operations. To be used by the initramfs code soon. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- fs/open.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 103c66309bee..75166f071d28 100644 --- a/fs/open.c +++ b/fs/open.c @@ -602,14 +602,19 @@ out_unlock: return error; } +int vfs_fchmod(struct file *file, umode_t mode) +{ + audit_file(file); + return chmod_common(&file->f_path, mode); +} + int ksys_fchmod(unsigned int fd, umode_t mode) { struct fd f = fdget(fd); int err = -EBADF; if (f.file) { - audit_file(f.file); - err = chmod_common(&f.file->f_path, mode); + err = vfs_fchmod(f.file, mode); fdput(f); } return err; -- cgit From 166e07c37c6417c9713666268fc0eb89a9ce48b9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sat, 6 Jun 2020 15:03:21 +0200 Subject: fs: remove ksys_open Just open code it in the two callers. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- fs/open.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 75166f071d28..ab3671af8a97 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1208,7 +1208,9 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { - return ksys_open(filename, flags, mode); + if (force_o_largefile()) + flags |= O_LARGEFILE; + return do_sys_open(AT_FDCWD, filename, flags, mode); } SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, @@ -1270,9 +1272,12 @@ COMPAT_SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, fla */ SYSCALL_DEFINE2(creat, const char __user *, pathname, umode_t, mode) { - return ksys_open(pathname, O_CREAT | O_WRONLY | O_TRUNC, mode); -} + int flags = O_CREAT | O_WRONLY | O_TRUNC; + if (force_o_largefile()) + flags |= O_LARGEFILE; + return do_sys_open(AT_FDCWD, pathname, flags, mode); +} #endif /* -- cgit From b25ba7c3c9acdc4cf69f5bd69989819cabfc4e3b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 14 Jul 2020 08:59:57 +0200 Subject: fs: remove ksys_fchmod Fold it into the only remaining caller. Signed-off-by: Christoph Hellwig Acked-by: Linus Torvalds --- fs/open.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index ab3671af8a97..b316dd6a86a8 100644 --- a/fs/open.c +++ b/fs/open.c @@ -608,7 +608,7 @@ int vfs_fchmod(struct file *file, umode_t mode) return chmod_common(&file->f_path, mode); } -int ksys_fchmod(unsigned int fd, umode_t mode) +SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) { struct fd f = fdget(fd); int err = -EBADF; @@ -620,11 +620,6 @@ int ksys_fchmod(unsigned int fd, umode_t mode) return err; } -SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) -{ - return ksys_fchmod(fd, mode); -} - int do_fchmodat(int dfd, const char __user *filename, umode_t mode) { struct path path; -- cgit From db63f1e315384590b979f8f74abd1b5363b69894 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 22 Jul 2020 11:25:21 +0200 Subject: init: add an init_chdir helper Add a simple helper to chdir with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_chdir. Signed-off-by: Christoph Hellwig --- fs/open.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index b316dd6a86a8..723e0ac89893 100644 --- a/fs/open.c +++ b/fs/open.c @@ -482,7 +482,7 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode) return do_faccessat(AT_FDCWD, filename, mode, 0); } -int ksys_chdir(const char __user *filename) +SYSCALL_DEFINE1(chdir, const char __user *, filename) { struct path path; int error; @@ -508,11 +508,6 @@ out: return error; } -SYSCALL_DEFINE1(chdir, const char __user *, filename) -{ - return ksys_chdir(filename); -} - SYSCALL_DEFINE1(fchdir, unsigned int, fd) { struct fd f = fdget_raw(fd); -- cgit From 4b7ca5014cbef51cdb99fd644eae4f3773747a05 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 22 Jul 2020 11:26:13 +0200 Subject: init: add an init_chroot helper Add a simple helper to chroot with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_chroot. Signed-off-by: Christoph Hellwig --- fs/open.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 723e0ac89893..f62f4752bb43 100644 --- a/fs/open.c +++ b/fs/open.c @@ -530,7 +530,7 @@ out: return error; } -int ksys_chroot(const char __user *filename) +SYSCALL_DEFINE1(chroot, const char __user *, filename) { struct path path; int error; @@ -563,11 +563,6 @@ out: return error; } -SYSCALL_DEFINE1(chroot, const char __user *, filename) -{ - return ksys_chroot(filename); -} - static int chmod_common(const struct path *path, umode_t mode) { struct inode *inode = path->dentry->d_inode; -- cgit From b873498f99c77e7b5be3aa5ffe9ca67437232fe0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 22 Jul 2020 11:13:26 +0200 Subject: init: add an init_chown helper Add a simple helper to chown with a kernel space file name and switch the early init code over to it. Signed-off-by: Christoph Hellwig --- fs/open.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index f62f4752bb43..49960a1248f1 100644 --- a/fs/open.c +++ b/fs/open.c @@ -639,7 +639,7 @@ SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode) return do_fchmodat(AT_FDCWD, filename, mode); } -static int chown_common(const struct path *path, uid_t user, gid_t group) +int chown_common(const struct path *path, uid_t user, gid_t group) { struct inode *inode = path->dentry->d_inode; struct inode *delegated_inode = NULL; -- cgit From 1097742efc643ffc667c5c6684635b2663145a7d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 22 Jul 2020 11:41:02 +0200 Subject: init: add an init_chmod helper Add a simple helper to chmod with a kernel space file name and switch the early init code over to it. Signed-off-by: Christoph Hellwig --- fs/open.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 49960a1248f1..7ba89eae46c5 100644 --- a/fs/open.c +++ b/fs/open.c @@ -563,7 +563,7 @@ out: return error; } -static int chmod_common(const struct path *path, umode_t mode) +int chmod_common(const struct path *path, umode_t mode) { struct inode *inode = path->dentry->d_inode; struct inode *delegated_inode = NULL; @@ -610,7 +610,7 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) return err; } -int do_fchmodat(int dfd, const char __user *filename, umode_t mode) +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) { struct path path; int error; -- cgit From eb9d7d390e51108b4c6a9a7993ed9be92548c8f7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 22 Jul 2020 11:14:02 +0200 Subject: init: add an init_eaccess helper Add a simple helper to check if a file exists based on kernel space file name and switch the early init code over to it. Note that this theoretically changes behavior as it always is based on the effective permissions. But during early init that doesn't make a difference. Signed-off-by: Christoph Hellwig --- fs/open.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index 7ba89eae46c5..aafecd1f7ba1 100644 --- a/fs/open.c +++ b/fs/open.c @@ -394,7 +394,7 @@ static const struct cred *access_override_creds(void) return old_cred; } -long do_faccessat(int dfd, const char __user *filename, int mode, int flags) +static long do_faccessat(int dfd, const char __user *filename, int mode, int flags) { struct path path; struct inode *inode; -- cgit From 633fb6ac39801514613fbe050db6abdc3fe744d5 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 11 Aug 2020 18:36:26 -0700 Subject: exec: move S_ISREG() check earlier The execve(2)/uselib(2) syscalls have always rejected non-regular files. Recently, it was noticed that a deadlock was introduced when trying to execute pipes, as the S_ISREG() test was happening too late. This was fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files during execve()"), but it was added after inode_permission() had already run, which meant LSMs could see bogus attempts to execute non-regular files. Move the test into the other inode type checks (which already look for other pathological conditions[1]). Since there is no need to use FMODE_EXEC while we still have access to "acc_mode", also switch the test to MAY_EXEC. Also include a comment with the redundant S_ISREG() checks at the end of execve(2)/uselib(2) to note that they are present to avoid any mistakes. My notes on the call path, and related arguments, checks, etc: do_open_execat() struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, .acc_mode = MAY_EXEC, ... do_filp_open(dfd, filename, open_flags) path_openat(nameidata, open_flags, flags) file = alloc_empty_file(open_flags, current_cred()); do_open(nameidata, file, open_flags) may_open(path, acc_mode, open_flag) /* new location of MAY_EXEC vs S_ISREG() test */ inode_permission(inode, MAY_OPEN | acc_mode) security_inode_permission(inode, acc_mode) vfs_open(path, file) do_dentry_open(file, path->dentry->d_inode, open) /* old location of FMODE_EXEC vs S_ISREG() test */ security_file_open(f) open() [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/ Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Aleksa Sarai Cc: Alexander Viro Cc: Christian Brauner Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Tetsuo Handa Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org Signed-off-by: Linus Torvalds --- fs/open.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'fs/open.c') diff --git a/fs/open.c b/fs/open.c index c80e9f497e9b..9af548fb841b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -779,12 +779,6 @@ static int do_dentry_open(struct file *f, return 0; } - /* Any file opened for execve()/uselib() has to be a regular file. */ - if (unlikely(f->f_flags & FMODE_EXEC && !S_ISREG(inode->i_mode))) { - error = -EACCES; - goto cleanup_file; - } - if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) { error = get_write_access(inode); if (unlikely(error)) -- cgit