Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"Recently, we added the ability to list mounts in other mount
namespaces and the ability to retrieve namespace file descriptors
without having to go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace
via NS_MNT_GET_INFO.
This will return the mount namespace id and the number of mounts
currently in the mount namespace. The number of mounts can be
used to size the buffer that needs to be used for listmount() and
is in general useful without having to actually iterate through
all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over
which the caller holds privilege returning the file descriptor
for the next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt
to it's owning user namespace. This means that PID 1 on the host
can list all mounts in all mount namespaces or that a container
can list all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
The commit message in 49224a345c48 ('Merge patch series "nsfs: iterate
through mount namespaces"') contains example code how to do this"
* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: iterate through mount namespaces
file: add fput() cleanup helper
fs: add put_mnt_ns() cleanup helper
fs: allow mount namespace fd
|
|
The kernel is writing an object of type __u64, so the ioctl has to be
defined to _IOR(NSIO, 0x5, __u64) instead of _IO(NSIO, 0x5).
Reported-by: Dmitry V. Levin <ldv@strace.io>
Link: https://lore.kernel.org/r/20240730164554.GA18486@altlinux.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It is already possible to list mounts in other mount namespaces and to
retrieve namespace file descriptors without having to go through procfs
by deriving them from pidfds.
Augment these abilities by adding the ability to retrieve information
about a mount namespace via NS_MNT_GET_INFO. This will return the mount
namespace id and the number of mounts currently in the mount namespace.
The number of mounts can be used to size the buffer that needs to be
used for listmount() and is in general useful without having to actually
iterate through all the mounts. The structure is extensible.
And add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the next or
previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt to it's
owning user namespace. This means that PID 1 on the host can list all
mounts in all mount namespaces or that a container can list all mounts
of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace
in one go. Both ioctls can be implemented for other namespace types
easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace-fs updates from Christian Brauner:
"This adds ioctls allowing to translate PIDs between PID namespaces.
The motivating use-case comes from LXCFS which is a tiny fuse
filesystem used to virtualize various aspects of procfs. LXCFS is run
on the host. The files and directories it creates can be bind-mounted
by e.g. a container at startup and mounted over the various procfs
files the container wishes to have virtualized.
When e.g. a read request for uptime is received, LXCFS will receive
the pid of the reader. In order to virtualize the corresponding read,
LXCFS needs to know the pid of the init process of the reader's pid
namespace.
In order to do this, LXCFS first needs to fork() two helper processes.
The first helper process setns() to the readers pid namespace. The
second helper process is needed to create a process that is a proper
member of the pid namespace.
The second helper process then creates a ucred message with ucred.pid
set to 1 and sends it back to LXCFS. The kernel will translate the
ucred.pid field to the corresponding pid number in LXCFS's pid
namespace. This way LXCFS can learn the init pid number of the
reader's pid namespace and can go on to virtualize.
Since these two forks() are costly LXCFS maintains an init pid cache
that caches a given pid for a fixed amount of time. The cache is
pruned during new read requests. However, even with the cache the hit
of the two forks() is singificant when a very large number of
containers are running.
So this adds a simple set of ioctls that let's a caller translate PIDs
from and into a given PID namespace. This significantly improves
performance with a very simple change.
To protect against races pidfds can be used to check whether the
process is still valid"
* tag 'vfs-6.11.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: add pid translation ioctls
|
|
In order to utilize the listmount() and statmount() extensions that
allow us to call them on different namespaces we need a way to get the
mnt namespace id from user space. Add an ioctl to nsfs that will allow
us to extract the mnt namespace id in order to make these new extensions
usable.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/180449959d5a756af7306d6bda55f41b9d53e3cb.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add ioctl()s to translate pids between pid namespaces.
LXCFS is a tiny fuse filesystem used to virtualize various aspects of
procfs. LXCFS is run on the host. The files and directories it creates
can be bind-mounted by e.g. a container at startup and mounted over the
various procfs files the container wishes to have virtualized. When e.g.
a read request for uptime is received, LXCFS will receive the pid of the
reader. In order to virtualize the corresponding read, LXCFS needs to
know the pid of the init process of the reader's pid namespace. In order
to do this, LXCFS first needs to fork() two helper processes. The first
helper process setns() to the readers pid namespace. The second helper
process is needed to create a process that is a proper member of the pid
namespace. The second helper process then creates a ucred message with
ucred.pid set to 1 and sends it back to LXCFS. The kernel will translate
the ucred.pid field to the corresponding pid number in LXCFS's pid
namespace. This way LXCFS can learn the init pid number of the reader's
pid namespace and can go on to virtualize. Since these two forks() are
costly LXCFS maintains an init pid cache that caches a given pid for a
fixed amount of time. The cache is pruned during new read requests.
However, even with the cache the hit of the two forks() is singificant
when a very large number of containers are running. With this simple
patch we add an ns ioctl that let's a caller retrieve the init pid nr of
a pid namespace through its pid namespace fd. This significantly
improves performance with a very simple change.
Support translation of pids and tgids. Other concepts can be added but
there are no obvious users for this right now.
To protect against races pidfds can be used to check whether the process
is still valid. If needed, this can also be extended to work on pidfds
directly.
Link: https://lore.kernel.org/r/20240619-work-ns_ioctl-v1-1-7c0097e6bb6b@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
license
Many user space API headers are missing licensing information, which
makes it hard for compliance tools to determine the correct license.
By default are files without license information under the default
license of the kernel, which is GPLV2. Marking them GPLV2 would exclude
them from being included in non GPLV2 code, which is obviously not
intended. The user space API headers fall under the syscall exception
which is in the kernels COPYING file:
NOTE! This copyright does *not* cover user programs that use kernel
services by normal system calls - this is merely considered normal use
of the kernel, and does *not* fall under the heading of "derived work".
otherwise syscall usage would not be possible.
Update the files which contain no license information with an SPDX
license identifier. The chosen identifier is 'GPL-2.0 WITH
Linux-syscall-note' which is the officially assigned identifier for the
Linux syscall exception. SPDX license identifiers are a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne. See the previous patch in this series for the
methodology of how this patch was researched.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
I'd like to write code that discovers the user namespace hierarchy on a
running system, and also shows who owns the various user namespaces.
Currently, there is no way of getting the owner UID of a user namespace.
Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches
the UID (as seen in the user namespace of the caller) of the creator of
the user namespace referred to by the specified file descriptor.
If the supplied file descriptor does not refer to a user namespace,
the operation fails with the error EINVAL. If the owner UID does
not have a mapping in the caller's user namespace return the
overflow UID as that appears easier to deal with in practice
in user-space applications.
-- EWB Changed the handling of unmapped UIDs from -EOVERFLOW
back to the overflow uid. Per conversation with
Michael Kerrisk after examining his test code.
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Linux 4.9 added two ioctl() operations that can be used to discover:
* the parental relationships for hierarchical namespaces (user and PID)
[NS_GET_PARENT]
* the user namespaces that owns a specified non-user-namespace
[NS_GET_USERNS]
For no good reason that I can glean, NS_GET_USERNS was made synonymous
with NS_GET_PARENT for user namespaces. It might have been better if
NS_GET_USERNS had returned an error if the supplied file descriptor
referred to a user namespace, since it suggests that the caller may be
confused. More particularly, if it had generated an error, then I wouldn't
need the new ioctl() operation proposed here. (On the other hand, what
I propose here may be more generally useful.)
I would like to write code that discovers namespace relationships for
the purpose of understanding the namespace setup on a running system.
In particular, given a file descriptor (or pathname) for a namespace,
N, I'd like to obtain the corresponding user namespace. Namespace N
might be a user namespace (in which case my code would just use N) or
a non-user namespace (in which case my code will use NS_GET_USERNS to
get the user namespace associated with N). The problem is that there
is no way to tell the difference by looking at the file descriptor
(and if I try to use NS_GET_USERNS on an N that is a user namespace, I
get the parent user namespace of N, which is not what I want).
This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given
a file descriptor that refers to a user namespace, returns the
namespace type (one of the CLONE_NEW* constants).
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.
In a future we will use this interface to dump and restore nested
namespaces.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Each namespace has an owning user namespace and now there is not way
to discover these relationships.
Understending namespaces relationships allows to answer the question:
what capability does process X have to perform operations on a resource
governed by namespace Y?
After a long discussion, Eric W. Biederman proposed to use ioctl-s for
this purpose.
The NS_GET_USERNS ioctl returns a file descriptor to an owning user
namespace.
It returns EPERM if a target namespace is outside of a current user
namespace.
v2: rename parent to relative
v3: Add a missing mntput when returning -EAGAIN --EWB
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lkml.org/lkml/2016/7/6/158
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|