summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-01-24namespace: only take read lock in do_reconfigure_mnt()Christian Brauner
do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock which seems unnecessary since we're not changing the superblock. We're only checking whether it is already read-only. Setting other mount attributes is protected by lock_mount_hash() afaict and not by s_umount. The history of down_write(&sb->s_umount) lock being taken when setting mount attributes dates back to the introduction of MNT_READONLY in [2]. This introduced the concept of having read-only mounts in contrast to just having a read-only superblock. When it got introduced it was simply plumbed into do_remount() which already took down_write(&sb->s_umount) because it was only used to actually change the superblock before [2]. Afaict, it would've already been possible back then to only use down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount options were protected by the vfsmount lock already. But that would've meant special casing the locking for MS_BIND | MS_REMOUNT in do_remount() which people might not have considered worth it. Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount) lock was simply copied over. Now that we have this be a separate helper only take the down_read(&sb->s_umount) lock since we're only interested in checking whether the super block is currently read-only and blocking any writers from changing it. Essentially, checking that the super block is read-only has the advantage that we can avoid having to go into the slowpath and through MNT_WRITE_HOLD and can simply set the read-only flag on the mount in set_mount_attributes(). [1]: commit 43f5e655eff7 ("vfs: Separate changing mount flags full remount") [2]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24mount: make {lock,unlock}_mount_hash() staticChristian Brauner
The lock_mount_hash() and unlock_mount_hash() helpers are never called outside a single file. Remove them from the header and make them static to reflect this fact. There's no need to have them callable from other places right now, as Christoph observed. Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namespace: take lock_mount_hash() directly when changing flagsChristian Brauner
Changing mount options always ends up taking lock_mount_hash() but when MNT_READONLY is requested and neither the mount nor the superblock are MNT_READONLY we end up taking the lock, dropping it, and retaking it to change the other mount attributes. Instead, let's acquire the lock once when changing the mount attributes. This simplifies the locking in these codepath, makes them easier to reason about and avoids having to reacquire the lock right after dropping it. Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24nfs: do not export idmapped mountsChristian Brauner
Prevent nfs from exporting idmapped mounts until we have ported it to support exporting idmapped mounts. Link: https://lore.kernel.org/linux-api/20210123130958.3t6kvgkl634njpsm@wittgenstein Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: "J. Bruce Fields" <bfields@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24overlayfs: do not mount on top of idmapped mountsChristian Brauner
Prevent overlayfs from being mounted on top of idmapped mounts. Stacking filesystems need to be prevented from being mounted on top of idmapped mounts until they have have been converted to handle this. Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24ecryptfs: do not mount on top of idmapped mountsChristian Brauner
Prevent ecryptfs from being mounted on top of idmapped mounts. Stacking filesystems need to be prevented from being mounted on top of idmapped mounts until they have have been converted to handle this. Link: https://lore.kernel.org/r/20210121131959.646623-28-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24ima: handle idmapped mountsChristian Brauner
IMA does sometimes access the inode's i_uid and compares it against the rules' fowner. Enable IMA to handle idmapped mounts by passing down the mount's user namespace. We simply make use of the helpers we introduced before. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24exec: handle idmapped mountsChristian Brauner
When executing a setuid binary the kernel will verify in bprm_fill_uid() that the inode has a mapping in the caller's user namespace before setting the callers uid and gid. Let bprm_fill_uid() handle idmapped mounts. If the inode is accessed through an idmapped mount it is mapped according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24would_dump: handle idmapped mountsChristian Brauner
When determining whether or not to create a coredump the vfs will verify that the caller is privileged over the inode. Make the would_dump() helper handle idmapped mounts by passing down the mount's user namespace of the exec file. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24ioctl: handle idmapped mountsChristian Brauner
Enable generic ioctls to handle idmapped mounts by passing down the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-22-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24init: handle idmapped mountsChristian Brauner
Enable the init helpers to handle idmapped mounts by passing down the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-21-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fcntl: handle idmapped mountsChristian Brauner
Enable the setfl() helper to handle idmapped mounts by passing down the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-20-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24utimes: handle idmapped mountsChristian Brauner
Enable the vfs_utimes() helper to handle idmapped mounts by passing down the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-19-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24open: handle idmapped mountsChristian Brauner
For core file operations such as changing directories or chrooting, determining file access, changing mode or ownership the vfs will verify that the caller is privileged over the inode. Extend the various helpers to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the permissions checks are identical to non-idmapped mounts. When changing file ownership we need to map the uid and gid from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: James Morris <jamorris@linux.microsoft.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24open: handle idmapped mounts in do_truncate()Christian Brauner
When truncating files the vfs will verify that the caller is privileged over the inode. Extend it to handle idmapped mounts. If the inode is accessed through an idmapped mount it is mapped according to the mount's user namespace. Afterwards the permissions checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: prepare for idmapped mountsChristian Brauner
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: introduce struct renamedataChristian Brauner
In order to handle idmapped mounts we will extend the vfs rename helper to take two new arguments in follow up patches. Since this operations already takes a bunch of arguments add a simple struct renamedata and make the current helper use it before we extend it. Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: handle idmapped mounts in may_*() helpersChristian Brauner
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24commoncap: handle idmapped mountsChristian Brauner
When interacting with user namespace and non-user namespace aware filesystem capabilities the vfs will perform various security checks to determine whether or not the filesystem capabilities can be used by the caller, whether they need to be removed and so on. The main infrastructure for this resides in the capability codepaths but they are called through the LSM security infrastructure even though they are not technically an LSM or optional. This extends the existing security hooks security_inode_removexattr(), security_inode_killpriv(), security_inode_getsecurity() to pass down the mount's user namespace and makes them aware of idmapped mounts. In order to actually get filesystem capabilities from disk the capability infrastructure exposes the get_vfs_caps_from_disk() helper. For user namespace aware filesystem capabilities a root uid is stored alongside the capabilities. In order to determine whether the caller can make use of the filesystem capability or whether it needs to be ignored it is translated according to the superblock's user namespace. If it can be translated to uid 0 according to that id mapping the caller can use the filesystem capabilities stored on disk. If we are accessing the inode that holds the filesystem capabilities through an idmapped mount we map the root uid according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts: reading filesystem caps from disk enforces that the root uid associated with the filesystem capability must have a mapping in the superblock's user namespace and that the caller is either in the same user namespace or is a descendant of the superblock's user namespace. For filesystems that are mountable inside user namespace the caller can just mount the filesystem and won't usually need to idmap it. If they do want to idmap it they can create an idmapped mount and mark it with a user namespace they created and which is thus a descendant of s_user_ns. For filesystems that are not mountable inside user namespaces the descendant rule is trivially true because the s_user_ns will be the initial user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24xattr: handle idmapped mountsTycho Andersen
When interacting with extended attributes the vfs verifies that the caller is privileged over the inode with which the extended attribute is associated. For posix access and posix default extended attributes a uid or gid can be stored on-disk. Let the functions handle posix extended attributes on idmapped mounts. If the inode is accessed through an idmapped mount we need to map it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. This has no effect for e.g. security xattrs since they don't store uids or gids and don't perform permission checks on them like posix acls do. Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24inode: make init and permission helpers idmapped mount awareChristian Brauner
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: make permission helpers idmapped mount awareChristian Brauner
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24capability: handle idmapped mountsChristian Brauner
In order to determine whether a caller holds privilege over a given inode the capability framework exposes the two helpers privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former verifies that the inode has a mapping in the caller's user namespace and the latter additionally verifies that the caller has the requested capability in their current user namespace. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior. Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: add file and path permissions helpersChristian Brauner
Add two simple helpers to check permissions on a file and path respectively and convert over some callers. It simplifies quite a few codepaths and also reduces the churn in later patches quite a bit. Christoph also correctly points out that this makes codepaths (e.g. ioctls) way easier to follow that would otherwise have to do more complex argument passing than necessary. Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24mount: attach mappings to mountsChristian Brauner
In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. Later patches enforce that once a mount has been idmapped it can't be remapped. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Add a new mnt_userns member to vfsmount and two simple helpers to retrieve the mnt_userns from vfsmounts and files. The idea to attach user namespaces to vfsmounts has been floated around in various forms at Linux Plumbers in ~2018 with the original idea tracing back to a discussion in 2017 at a conference in St. Petersburg between Christoph, Tycho, and myself. Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-23iomap: add a IOMAP_DIO_OVERWRITE_ONLY flagChristoph Hellwig
Add a flag to signal that only pure overwrites are allowed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23iomap: pass a flags argument to iomap_dio_rwChristoph Hellwig
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23iomap: rename the flags variable in __iomap_dio_rwChristoph Hellwig
Rename flags to iomap_flags to make the usage a little more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23cifs: do not fail __smb_send_rqst if non-fatal signals are pendingRonnie Sahlberg
RHBZ 1848178 The original intent of returning an error in this function in the patch: "CIFS: Mask off signals when sending SMB packets" was to avoid interrupting packet send in the middle of sending the data (and thus breaking an SMB connection), but we also don't want to fail the request for non-fatal signals even before we have had a chance to try to send it (the reported problem could be reproduced e.g. by exiting a child process when the parent process was in the midst of calling futimens to update a file's timestamps). In addition, since the signal may remain pending when we enter the sending loop, we may end up not sending the whole packet before TCP buffers become full. In this case the code returns -EINTR but what we need here is to return -ERESTARTSYS instead to allow system calls to be restarted. Fixes: b30c74c73c78 ("CIFS: Mask off signals when sending SMB packets") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-22xfs: reduce ilock acquisitions in xfs_file_fsyncChristoph Hellwig
If the inode is not pinned by the time fsync is called we don't need the ilock to protect against concurrent clearing of ili_fsync_fields as the inode won't need a log flush or clearing of these fields. Not taking the iolock allows for full concurrency of fsync and thus O_DSYNC completions with io_uring/aio write submissions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-01-22xfs: refactor xfs_file_fsyncChristoph Hellwig
Factor out the log syncing logic into two helpers to make the code easier to read and more maintainable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-01-22xfs: cover the log on freeze instead of cleaning itBrian Foster
Filesystem freeze cleans the log and immediately redirties it so log recovery runs if a crash occurs after the filesystem is frozen. Now that log quiesce covers the log, there is no need to clean the log and redirty it to trigger log recovery because covering has the same effect. Update xfs_fs_freeze() to quiesce (and thus cover) the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: remove xfs_quiesce_attr()Brian Foster
xfs_quiesce_attr() is now a wrapper for xfs_log_clean(). Remove it and call xfs_log_clean() directly. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: remove duplicate wq cancel and log force from attr quiesceBrian Foster
These two calls are repeated at the beginning of xfs_log_quiesce(). Drop them from xfs_quiesce_attr(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: fold sbcount quiesce logging into log coveringBrian Foster
xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to disk when lazy superblock accounting is enabled. This occurs on unmount, freeze, and read-only (re)mount and ensures the final values are calculated and persisted to disk before each form of quiesce completes. Now that log covering occurs in all of these contexts and uses the same xfs_sync_sb() mechanism to update log state, there is no need to log the superblock separately for any reason. Update the log quiesce path to sync the superblock at least once for any mount where lazy superblock accounting is enabled. If the log is already covered, it will remain in the covered state. Otherwise, the next sync as part of the normal covering sequence will carry the associated superblock update with it. Remove xfs_log_sbcount() now that it is no longer needed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: don't reset log idle state on covering checkpointsBrian Foster
Now that log covering occurs on quiesce, we'd like to reuse the underlying superblock sync for final superblock updates. This includes things like lazy superblock counter updates, log feature incompat bits in the future, etc. One quirk to this approach is that once the log is in the IDLE (i.e. already covered) state, any subsequent log write resets the state back to NEED. This means that a final superblock sync to an already covered log requires two more sb syncs to return the log back to IDLE again. For example, if a lazy superblock enabled filesystem is mount cycled without any modifications, the unmount path syncs the superblock once and writes an unmount record. With the desired log quiesce covering behavior, we sync the superblock three times at unmount time: once for the lazy superblock counter update and twice more to cover the log. By contrast, if the log is active or only partially covered at unmount time, a final superblock sync would doubly serve as the one or two remaining syncs required to cover the log. This duplicate covering sequence is unnecessary because the filesystem remains consistent if a crash occurs at any point. The superblock will either be recovered in the event of a crash or written back before the log is quiesced and potentially cleaned with an unmount record. Update the log covering state machine to remain in the IDLE state if additional covering checkpoints pass through the log. This facilitates final superblock updates (such as lazy superblock counters) via a single sb sync without losing covered status. This provides some consistency with the active and partially covered cases and also avoids harmless, but spurious checkpoints when quiescing the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: cover the log during log quiesceBrian Foster
The log quiesce mechanism historically terminates by marking the log clean with an unmount record. The primary objective is to indicate that log recovery is no longer required after the quiesce has flushed all in-core changes and written back filesystem metadata. While this is perfectly fine, it is somewhat hacky as currently used in certain contexts. For example, filesystem freeze quiesces (i.e. cleans) the log and immediately redirties it with a dummy superblock transaction to ensure that log recovery runs in the event of a crash. While this functions correctly, cleaning the log from freeze context is clearly superfluous given the current redirtying behavior. Instead, the desired behavior can be achieved by simply covering the log. This effectively retires all on-disk log items from the active range of the log by issuing two synchronous and sequential dummy superblock update transactions that serve to update the on-disk log head and tail. The subtle difference is that the log technically remains dirty due to the lack of an unmount record, though recovery is effectively a no-op due to the content of the checkpoints being clean (i.e. the unmodified on-disk superblock). Log covering currently runs in the background and only triggers once the filesystem and log has idled. The purpose of the background mechanism is to prevent log recovery from replaying the most recently logged items long after those items may have been written back. In the quiesce path, the log has been deliberately idled by forcing the log and pushing the AIL until empty in a context where no further mutable filesystem operations are allowed. Therefore, we can cover the log as the final step in the log quiesce codepath to reflect that all previously active items have been successfully written back. This facilitates selective log covering from certain contexts (i.e. freeze) that only seek to quiesce, but not necessarily clean the log. Note that as a side effect of this change, log covering now occurs when cleaning the log as well. This is harmless, facilitates subsequent cleanups, and is mostly temporary as various operations switch to use explicit log covering. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-01-22xfs: separate log cleaning from log quiesceBrian Foster
Log quiesce is currently associated with cleaning the log, which is accomplished by writing an unmount record as the last step of the quiesce sequence. The quiesce codepath is a bit convoluted in this regard due to how it is reused from various contexts. In preparation to create separate log cleaning and log covering interfaces, lift the write of the unmount record into a new cleaning helper and call that wherever xfs_log_quiesce() is currently invoked. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: lift writable fs check up into log worker taskBrian Foster
The log covering helper checks whether the filesystem is writable to determine whether to cover the log. The helper is currently only called from the background log worker. In preparation to reuse the helper from freezing contexts, lift the check into xfs_log_worker(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: sync lazy sb accounting on quiesce of read-only mountsBrian Foster
xfs_log_sbcount() syncs the superblock specifically to accumulate the in-core percpu superblock counters and commit them to disk. This is required to maintain filesystem consistency across quiesce (freeze, read-only mount/remount) or unmount when lazy superblock accounting is enabled because individual transactions do not update the superblock directly. This mechanism works as expected for writable mounts, but xfs_log_sbcount() skips the update for read-only mounts. Read-only mounts otherwise still allow log recovery and write out an unmount record during log quiesce. If a read-only mount performs log recovery, it can modify the in-core superblock counters and write an unmount record when the filesystem unmounts without ever syncing the in-core counters. This leaves the filesystem with a clean log but in an inconsistent state with regard to lazy sb counters. Update xfs_log_sbcount() to use the same logic xfs_log_unmount_write() uses to determine when to write an unmount record. This ensures that lazy accounting is always synced before the log is cleaned. Refactor this logic into a new helper to distinguish between a writable filesystem and a writable log. Specifically, the log is writable unless the filesystem is mounted with the norecovery mount option, the underlying log device is read-only, or the filesystem is shutdown. Drop the freeze state check because the update is already allowed during the freezing process and no context calls this function on an already frozen fs. Also, retain the shutdown check in xfs_log_unmount_write() to catch the case where the preceding log force might have triggered a shutdown. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: set inode size after creating symlinkJeffrey Mitchell
When XFS creates a new symlink, it writes its size to disk but not to the VFS inode. This causes i_size_read() to return 0 for that symlink until it is re-read from disk, for example when the system is rebooted. I found this inconsistency while protecting directories with eCryptFS. The command "stat path/to/symlink/in/ecryptfs" will report "Size: 0" if the symlink was created after the last reboot on an XFS root. Call i_size_write() in xfs_symlink() Signed-off-by: Jeffrey Mitchell <jeffrey.mitchell@starlab.io> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-01-22xfs: don't drain buffer lru on freeze and read-only remountBrian Foster
xfs_buftarg_drain() is called from xfs_log_quiesce() to ensure the buffer cache is reclaimed during unmount. xfs_log_quiesce() is also called from xfs_quiesce_attr(), however, which means that cache state is completely drained for filesystem freeze and read-only remount. While technically harmless, this is unnecessarily heavyweight. Both freeze and read-only mounts allow reads and thus allow population of the buffer cache. Therefore, the transitional sequence in either case really only needs to quiesce outstanding writes to return the filesystem in a generally read-only state. Additionally, some users have reported that attempts to freeze a filesystem concurrent with a read-heavy workload causes the freeze process to stall for a significant amount of time. This occurs because, as mentioned above, the read workload repopulates the buffer LRU while the freeze task attempts to drain it. To improve this situation, replace the drain in xfs_log_quiesce() with a buffer I/O quiesce and lift the drain into the unmount path. This removes buffer LRU reclaim from freeze and read-only [re]mount, but ensures the LRU is still drained before the filesystem unmounts. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain()Brian Foster
xfs_wait_buftarg() is vaguely named and somewhat overloaded. Its primary purpose is to reclaim all buffers from the provided buffer target LRU. In preparation to refactor xfs_wait_buftarg() into serialization and LRU draining components, rename the function and associated helpers to something more descriptive. This patch has no functional changes with the minor exception of renaming a tracepoint. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: Fix assert failure in xfs_setattr_size()Yumei Huang
An assert failure is triggered by syzkaller test due to ATTR_KILL_PRIV is not cleared before xfs_setattr_size. As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size, just remove it from the assert. Signed-off-by: Yumei Huang <yuhuang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: fix up non-directory creation in SGID directoriesChristoph Hellwig
XFS always inherits the SGID bit if it is set on the parent inode, while the generic inode_init_owner does not do this in a few cases where it can create a possible security problem, see commit 0fa3ecd87848 ("Fix up non-directory creation in SGID directories") for details. Switch XFS to use the generic helper for the normal path to fix this, just keeping the simple field inheritance open coded for the case of the non-sgid case with the bsdgrpid mount option. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: remove a stale comment from xfs_file_aio_write_checks()Eric Biggers
The comment in xfs_file_aio_write_checks() about calling file_modified() after dropping the ilock doesn't make sense, because the code that unconditionally acquires and drops the ilock was removed by commit 467f78992a07 ("xfs: reduce ilock hold times in xfs_file_aio_write_checks"). Remove this outdated comment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>